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Abstract 



In this dissertation we develop a new formal graphical framework for 
causal reasoning. Starting with a review of monoidal categories and their 
associated graphical languages, we then revisit probability theory from 
a categorical perspective and introduce Bayesian networks, an existing 
structure for describing causal relationships. Motivated by these, we pro- 
pose a new algebraic structure, which we term a causal theory. These take 
the form of a symmetric monoidal category, with the objects representing 
variables and morphisms ways of deducing information about one vari- 
able from another. A major advantage of reasoning with these structures 
is that the resulting graphical representations of morphisms match well 
with intuitions for flows of information between these variables. These 
categories can then be modelled in other categories, providing concrete 
interpretations for the variables and morphisms. In particular, we shall 
see that models in the category of measurable spaces and stochastic maps 
provide a slight generalisation of Bayesian networks, and naturally form a 
category themselves. We conclude with a discussion of this category, clas- 
sifying the morphisms and discussing some basic universal constructions. 
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Introduction 



From riding a bicycle to buying flowers for a friend, causal relationships form a basic 
and ubiquitous framework informing, at least informally, how we organise, reason 
about, and choose to interact with the world. It is perhaps surprising then that 
ideas of causality often are entirely absent from our formal scientific frameworks, 
whether they directly be models of the world, such as theories of physics, or methods 
for extracting information from data, as in the case of statistical techniques. It is 
the belief of the author that there remains much to be gained from formalising our 
intuitions regarding causality. 

Indeed, taking the view that causal relationships are fundamental physical facts 
about the world, it is interesting to discover and discuss these in their own right. Even 
if one views causality only as a convenient way to organise information about depen- 
dencies between variables, however, it is hard to see how introducing such notions 
into formal theories, rather than simply ignoring these intuitions, will not benefit at 
least some of them. The artificial intelligence community gives a tangible example of 
this, with the widespread use of Bayesian networks indicating that causal relation- 
ships provide a far more efficient way to encode, update, and reason with information 
about random variables then simply working with the entire joint variable. 

In what follows we lay out the beginnings of a formal framework for reasoning 
about causality. We shall do this by extending the aforementioned existing ideas for 
describing causal relationships between random variables through the use of category 
theory, and in particular the theory of monoidal categories. 

Overview of the literature 

More precisely, in this dissertation we aim to bridge three distinct ideas. The first is 
the understanding of probability theory and probabilistic processes from a categorical 
perspective. For this we work with a category first defined by Lawvere a half-century 
ago in the unpublished manuscript [13], in which the objects are sets equipped with 
a (T-algebra and the morphisms specify a measure of the codomain for each element 
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of the domain, subject to a regularity condition. These ideas were later developed, in 
1982, in a short paper by Giry [9], and have been further explored by Doberkat [6], 
Panangaden [15], and Wendt [25], among others, in recent years. 

The second idea is that of using graphical models to depict causal relationships. 
Termed Bayesian networks, these were first discussed predominantly in the machine 
learning community in the 1980s, with the seminal work coming in the influential 
book Pearl [IS]. Since then Bayesian networks have been used extensively to discuss 
causality from both computational and philosophical perspectives, as can be seen in 
recent books Pearl [19] and Williamson [26] . 

The third body of work, which will serve as a framework to unite the above two 
ideas, is the theory of monoidal categories and their graphical calculi. An introductory 
exposition of monoidal categories can be found in Mac Lane [H], while the survey 
by Selinger [2^ provides an excellent overview of the graphical ideas. Here our work 
is in particular influenced by that of the very recent paper by Coecke and Spekkens 
[1], which uses monoidal categories to picture Bayesian inference, realising Bayesian 
inversion as a compact structure on an appropriate category. 

Outline 

Since they serve as the underlying framework for this thesis, we begin with a chapter 
reviewing the theory of monoidal categories, the last of the above ideas. We conclude 
this first chapter by discussing how the idea of a monoid can be generalised through 
a category we call the 'theory of monoids', with monoids themselves being realised 
as monoidal functors from this category into the category Set of sets and functions, 
while 'generalised monoids' take the form of monoidal functors from the theory of 
monoids into other categories. It is in this sense the 'causal theories' that we will 
define are theories. In Chapter 2 we turn our attention to reviewing the basic ideas 
of measure theoretic probability theory from a categorical viewpoint. Here we pay 
particular attention to the category, which will shall call Stoch, defined by Lawvere. 
Chapter 3 then provides some background on Bayesian networks, stating a few results 
about how they capture causal relationships between random variables through ideas 
of conditional independence. This motivates the definition of a causal theory, which 
we present in Chapter 4. Following our exploration of these categories and how to 
represent their morphisms graphically, we turn our attention to their models. Al- 
though models in Set and Rel are interesting, we spend most of the time discussing 
models in Lawvere's category Stoch. This chapter is concluded by a discussion of 
confounding variables and Simpson's paradox, where we see some of the strengths of 
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causal theories and in particular their graphical languages. In short, we show we can 
take the directed graph structure of a Bayesian network more seriously than just a 
suggestive depiction of dependencies. In the final chapter, Chapter 5, we discuss some 
properties of the category of stochastic causal models of a fixed causal theory. These 
are models in a certain full subcategory of Stoch that omits various pathological 
eventuations, and have a close relationship with Bayesian networks. 

New contributions 

The main contribution of this dissertation is the presentation of a new algebraic 
structure: causal theories. These are a type of symmetric monoidal category, and we 
will discuss how these capture the notion of causal relationships between variables; 
deterministic, possibilistic and probabilistic models of these categories; how their 
graphical calculi provide intuitive representations of reasoning and information flow; 
and some of the structure of the category of probabilistic models. In doing so we 
also move the discussion of Bayesian networks from the finite setting preferred by 
the computationally-focussed Bayesian network community to a more general setting 
capable of handling non-discrete probability spaces. 

In particular, I claim all results of Chapters 4 and 5 as my own as well as, except 
for Proposition \2.17\ the discussion of deterministic stochastic maps of Section 2.4. 
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Chapter 1 

Preliminaries on Monoidal 
Categories 

Our aim is to explore representations of causal relationships between random variables 
from a categorical perspective. In this chapter we lay the foundations for this by 
introducing the basic language we will be working with — the language of monoidal 
categories — and, by way of example, informally discussing the notion of a 'theory'. 

We begin with a review of the relevant notions from category theory. Recall that 
a category C consists of a collection ObC of objects, for each pair A, B of objects a 
set Mor(A, B) of morphisms, and for each triple A, B, C of objects a function, or 
composition rule, M.or{A,B) x Mor(S,C) — >■ Mor{A,C), such that the composition 
rule is associative and obeys a unit law. We shall write A e C if A is an object of 
the category C, and / in C if / is a morphism in the category C. As we shall think 
of them, categories are the basic algebra structure capturing the idea of composable 
processes, with the objects of a category different systems of a given type, and the 
morphisms processes transforming one system into another. 

We further remind ourselves that a functor is a map from one category to another 
preserving the composition rule, that under a mild size constraint the collection of 
categories itself forms a category with functors as morphisms, and that in this category 
products exist. Moreover, the set of functors between any two categories itself has the 
structure of a category in a standard, nontrivial way, and we call the morphisms in 
this category natural transformations, with the invertible ones further called natural 
isomorphisms. Two categories are equivalent if there exists a functor in each direction 
between the two such that their compositions in both orders are naturally isomorphic 
to the identity functor. 
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The reader seeking more detail is referred to Mac Lane [H], in particular Chap- 
ters I and II. In general our terminology and notation for categories will follow the 
conventions set out there. 

1.1 Monoidal categories 

The key structure of interest to us in the following is that of a symmetric monoidal cat- 
egory. A monoidal category is a category with two notions of composition — ordinary 
categorical composition and the monoidal composition — , and symmetric monoidal 
categories may be thought of as the algebraic structure of processes that may occur 
simultaneously as well as sequentially. These categories are of special interest as they 
may be described precisely with a graphical notation possessing a logic that agrees 
well with natural topological intuitions. Among other things, this has been used to 
great effect in describing quantum protocols by Abramsky and Coecke |2], and this 
work in part motivates that presented here. 

Definition 1.1 (Monoidal category). A monoidal category {C,^, I ,a, p, X) consists 
of a category C, together with a functor (g) : C x C — )■ C, a distinguished object I E C, 
for all objects A, B,C E C isomorphisms aA,B,c '■ {A ^ B) ^ C — )■ A ® (_B ® C) in 
C natural in A, B, C, and for all objects A E C isomorphisms pj^ : A ^ I ^ A and 
Xa '■ I ^ A A in C natural in A. To form a monoidal category, this data is subject 
to two equations: the pentagon equation 




A (E) {B ® {C ^ D)) 
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and the triangle equation 

(A (g) /) ® > A (g) (/ (g) 5) 




We call (g the monoidal product, I the monoidal unit, the isomorphisms a, as- 
sociators, and the isomorphisms p,p^^ and A, A^^ right- and left-unitors respectively. 
Collectively, we call the associators and unitors the structure maps of our monoidal 
category. We will often just write C for a monoidal category (C, (g, /, a, p, A), leaving 
the remaining data implicit. 

The associators express the fact that the product objects (A (g _B) (g C and 
A (g (5 (g C) are in some sense the same — they are isomorphic via some canoni- 
cal isomorphism — , while the unitors express the fact that A, A(g/, and I^A are the 
same. If these objects are in fact equal, and the structure maps are simply identity 
maps, then we say that our monoidal category is a strict monoidal category. In this 
case then any two objects that can be related by structure maps are equal, and so 
we may write objects without parentheses and units without ambiguity. Although, 
importantly, this is not true in all cases, it is essentially true: loosely speaking, 
the triangle and pentagon equations in fact imply that any diagram of their general 
kind, expressing composites of structure maps between different ways of forming the 
monoidal product of some objects, commutes. This is known as Mac Lane's coherence 
theorem for monoidal categorise; see Mac Lane [Ml Corollary of Theorem VII. 2.1] for 
a precise statement and proof. 

In a monoidal category, the objects A B and B ^ A need not in general be 
related in any way. In the cases we will interest ourselves, however, we will not want 
the order in which we write the objects in a tensor product to matter — all products 
consisting of a given collection of objects should be isomorphic, and isomorphic in a 
way we need not worry about the isomorphism itself. This additional requirement 
turns a monoidal category into a symmetric monoidal category. 

Definition 1.2 (Symmetric monoidal category). A symmetric monoidal category 
{C,^, I ,a, p, X,a) consists of a monoidal category (C, (g, J, a, p. A) together with a 
collection of isomorphisms aA,B : A B ^ B A natural in A and B such that 
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° <^A,B = idAig)B and such that for all objects A, B, C the hexagon 



{A®B)®C 





A®{B®C) 



{B®A)®C 



{B®C)®A 



B®{A®C) 





B®{C®A) 



commutes. 

We call the isomorphisms cr_,_ swaps. 

As for monoidal categories, we have a coherence theorem for symmetric monoidal 
categories, stating in essence that all diagrams composed of identities, associators, 
unitors, and swaps commute. Details can again be found in Mac Lane [HI Theorem 



Examples 1.3 (FVect, Mat). An historically important example of a symmetric 
monoidal category is that of FVectjR, the category of finite vector spaces over M 
with linear maps as morphisms, tensor product as monoidal product. Here M is the 
monoidal unit, and the structure maps are the obvious isomorphisms of tensor prod- 
ucts of vector spaces. Note that this is not a strict symmetric monoidal category: it is 
not true for real vector spaces U, V, W that we consider {U ®V) ®W and U ® {V®W) 
as equal, but we do always have a canonical isomorphism between the two. 

A related strict symmetric monoidal category is Mat(R), the category with ob- 
jects natural numbers, morphisms from m G N to n G N given by n x m matrices 
over M, composition given by composition of matrices, monoidal product given by 
multiplication on objects and Kronecker product of matrices on morphisms. 

Example 1.4 (Set) . The category Set of sets and functions forms a symmetric monoidal 
category with the cartesian product x. In this category any singleton set {*} may be 
taken as the monoidal unit. Indeed, any category with finite products can be viewed 
as a symmetric monoidal category by taking the binary categorical product as the 
monoidal product, and the terminal object as the monoidal unit. The associators, un- 
itors, and swaps are then specified by the unique isomorphisms given by the universal 
property of the product. 



XI.1.1]. 
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Example 1.5 (Rel) . The category Rel of sets and relations forms a symmetric monoidal 
category with cartesian product x and unit {*}. Here the monoidal product r <S> s : 
XxZ^YxWoi relations rX ^ Y and s : Z ^ W is the relation such that 
{x,z) e X X Z is related to {y,w) e Y x W ii and only if x is related to y by r and 
z is related to w by s. 

Intuitively, the standard embedding of Set into Rel, given by viewing functions 
as relations, is an embedding that respects the monoidal structure. To make this 
precise we need to talk about monoidal functors. 

When working with monoidal categories, it is often desirable to have functors 
between these categories preserve the monoidal structure, and to have natural trans- 
formations between these functors preserve the monoidal structure too. The same is 
true in the case of functors between symmetric monoidal categories. We thus intro- 
duce the notions of monoidal functors, symmetric monoidal functors, and monoidal 
natural transformations. 

Definition 1.6 (Monoidal functor). Let C,C' be monoidal categories. A monoidal 
functor {F, F^, F*) : C ^ C from C to C consists of a functor F : C ^ C, for all 
objects A,BeC morphisms 



in C which are natural in A and B, and for the units I of C and /' of C a morphism 
F^: r ^ F{I) in C, such that for all A,B,C eC the hexagon 



F^,A,B : F{A) ^ F{B) ^ F{A ® B) 



{FA ^ FB) ® FC 





FA ® {FB (g) FC) 



F{A<^B)<^FC 



FA (8) F{B (g) C) 



F{{A^B)^C) 





F{A®{B^C)) 
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and the two squares 



F{A) (g) /' ■ 



Fp 



I' ® F{A) 



^F{A) 



FA 



FiA) ® Fil) > FiA ® I) Fil) ® FiA) > Fil ® A) 

commute. 

We further say a monoidal functor is a strong monoidal functor if the morphisms 
F^^A,B and F^ are isomorphisms for all A,B E C. 

Definition 1.7 (Symmetric monoidal functor). A symmetric monoidal functor 

(F, F^, F^) : C ^ C between symmetric monoidal categories C and C is a monoidal 

functor (F, F^, F*) such that 

FA (8) FB F{A ® B) 



FcrA,B 



FB^FA- 



F{B (g) A) 



commutes for all ^4, 5 e C. 



Definition 1.8 (Monoidal natural transformation). A monoidal natural transforma- 
tion 9 : F ^ G between two monoidal functors F and G is a natural transformation 
9 : F ^ G such that the triangle 



F7 



V 



^GI 



(MNTl) 



and square 



FA ® FB 



i,A,B 



^GA®GB 



(MNT2) 



F{A®B)— >G{A^B) 



commute for all objects A,B. 



Example 1.9. FVect]R and Mat(M) are equivalent via strong monoidal functors. It is 
a corollary of the Mac Lane Coherence theorems that any monoidal category can be 
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'strictified' — that is, for any monoidal category there exists a strict monoidal category 
equivalent to it via strong monoidal functors — and that a symmetric monoidal cat- 
egory can be strictified into a strict symmetric monoidal category. For more details 
see [m Theorem XI.3.1]. 

1.2 Graphical calculi 

One of the draws of expressing concepts as symmetric monoidal categories is that 
the structure of these categories naturally lends itself to being expressed pictorially. 
These pictures, known as string diagrams, represent the morphisms of a monoidal 
category, and have the benefit of hiding certain structural equalities and making use 
of our topological intuitions to suggest other important equalities. The aim of this 
section is merely to give the reader a basic working understanding of how to read and 
draw these diagrams; we leave the precise definition of a string diagram and proofs 
of their expressiveness to the survey [20j of Selinger. 

String diagrams are drawn in two dimensions with, roughly speaking, one dimen- 
sion representing the categorical composition and the other representing monoidal 
composition. We shall take the convention, common but far from universal, that 
we read composition up the page, leaving horizontal juxtaposition to represent the 
monoidal product of maps. Under this convention then, a string diagram consists of 
a graph with edges labelled by objects and vertices labelled by morphisms, which as 
a whole represents a morphism with domain the monoidal product of the edges at 
the lower end of the diagram, and codomain the monoidal product of the edges at 
the top. 

The simplest example, consisting of just a single edge, represents the identity map: 

X 

idx = 

X 

More generally, we represent a morphism / : X — )■ F by drawing in sequence up the 
page an edge labelled by X, ending at a vertex labelled by /, which then gives rise 
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to an edge labelled by Y: 



Y 



X 

liX — A^Bi^C, and Y — D <^ E, we could also represent / as: 





D L 




/ 











ABC 



Given maps f : X ^ Y and g :Y ^ Z, we represent their composite g o f hy 
placing a vertex g on the F-edge leaving the vertex /: 

Z 

9 

X 

If the types of the maps are known, we lose no information if we omit the labels 
of edges that are connected to the vertices of the maps, as we have done for edge 
representing Y in the above diagram. For the sake of cleanness and readability, we 
shall most often just label the 'input' and 'output' edges at the top and bottom of 
the diagram. 

The monoidal product of two maps is given by their horizontal juxtaposition, with 
juxtaposition on the right representing monoidal product on the right, and on the left 
representing left monoidal product. As an example, given morphisms f : X ^Y and 
g : Z ^W,we write their product f g : X Z ^ Y (S>W as: 

Y W 



X z 



The monoidal unit is an object with special properties in the category, and as a result 
the conventions for representing the unit diagrammatically are a little different: we 
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don't draw it or its identity map 

id/ 



This has the advantage of any diagram representing a morphism f : A ^ B also 
representing the 'equivalent' morphism / ® id/ : A ® I — )■ B ^ I, among other 
equivalent morphisms. 

To read an arbitrary string diagram, it is often easiest to start at the lower edge 
and move up the diagram, reading off a morphism for every horizontal cross-section 
intersecting a vertex. The string diagram then represents the composite of these 
morphisms in the order that the morphisms were read, applying associators and 
unitors as needed for the map to be well-defined. For example, reading in this way 
the diagram 



h 




k 


1 

/ 
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represents the map {f ^ g) o [h ^ k) . Note that it may also be read {f o h) ^ {g o k) , 
or even (/ o idx o h) ^ {g o k) ^ id/ where X is the codomain of /, but in any case 
all these different algebraic descriptions of the picture represent the same morphism. 
This is a key feature of string diagrams: many equalities of algebraic representations 
of morphisms become just the identity of diagrams. Furthermore, we need not be 
too careful about the precise geometry of the diagrams; the following topologically 
equivalent diagrams in fact also express equal morphisms: 

\ I 



h k 



h 



f 



h 



f 



f 



This holds true in general. 

Theorem 1.10 (Coherence of the graphical calculus for monoidal categories). Two 

morphisms in a monoidal category are equal with their equality following from the 
axioms of monoidal categories if and only if their diagrams are equal up to planar 
deformation. 



Proof. Joyal-Street [HI Theorem 1.2]. 



□ 
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In a symmetric monoidal category, we usually omit the label for the swap, denoting 
it instead just by the intersection of two strings: 

B A B A 



A B 




We will also later take such an approach for other chosen maps, such as the multipli- 
cation and unit of a monoid. 

The defining identities of the swap may then be written graphically as 

A B A B 



(Syml) 




and 



A B 



B C A 



A B 



B C A 




(Sym2) 



ABC ABC 
Including these identity into our collection of allowable transformations of diagrams 
gives coherence theorem for symmetric monoidal categories. 

Theorem 1.11 (Coherence of the graphical calculus for symmetric monoidal cate- 
gories) . Two morphisms in a symmetric monoidal category are equal with their equal- 
ity following from the axioms of symmetric monoidal categories if and only if their 
diagrams are equal up to planar deformation and local applications of the identities 



Syml and Sym2 



Proof. Joyal-Street |llt Theorem 2.3]. 



□ 



Just as two diagrams represent the same morphism in a monoidal category if they 
agree up to planar isotopy, this theorem may be regarded geometrically as stating 
that two diagrams represent the same morphism in a monoidal category if they agree 
up to isotopy in four dimensions. 



13 



These two theorems show that the graphical calcuh go beyond visuahsations of the 
morphisms, having the abihty to provide bona-fide proofs of equahties of morphisms. 
As a general principle, one which we shall demonstrate in this dissertation, this fact 
combined the intuitiveness of manipulations and the encoding of certain equalities 
and structural isomorphisms make the string diagrams better than the conventional 
algebraic language for understanding monoidal categories. 

1.3 Example: the theory of monoids 

This section serves to both give examples of the constructions defined in this chapter 
and, more importantly, give a flavour of the spirit in which we will aim to use monoidal 
categories to discuss causahty. 

Recall that a monoid is a set with an associative, unital binary operation. We 
shall classify these as strong monoidal functors from a category Th(Mon) into Set, 
and hence say that this category Th(Mon) describes the theory of monoids. The 
study of this category and its functorial images then gives new and interesting per- 
spectives of the concept of a monoid, its generalisations, and relationships to other 
mathematical structures. In analogy to this, we will later define causal theories as 
monoidal categories that can be modelled within other categories through monoidal 
functors. 

Define the category Th(Mon) as follows: fix some symbol M, and let the objects 
of Th(Mon) be any natural number of copies of this symbol. We shall write the 
objects M®", where n e N is the number of copies of M. Then the monoidal product 
on the objects of Th(Mon) is just addition of number of copies of M, with I — 
the monoidal unit. By definition this is a strict monoidal category, so we need not 
worry about the structure maps. 

In addition to the identity morphism on each object, we also include morphisms 
rS : M (8) M — >■ M and i : 7 — >■ M and all their composites and products, subject to 
the relations 

(associativity) 
and 

(unitality) 
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These equations correspond respectively to the associativity and unitahty laws for 
the monoid. 

Now given any monoid (X, - ,1), we can define the strong monoidal functor F : 
Th(Mon) —7- Set mapping M®" to the n-fold cartesian product X" of the set X, 
m to the monoid multiplication function X x X — )■ X; (x, y) ^ x ■ y, and e to 
the function {*} — )■ X; * i— )■ 1 with image the monoid unit. This is well-defined as 
the relations obeyed by m and e are precisely those required to ensure the monoid 
operation ■ is associative and unital. Furthermore, taking the canonical isomorphisms 
X"^ X X" — X^^^ given by the universal property of products, we see that F is a 
strong monoidal functor. 

Conversely, given any strong monoidal functor {F, F^, F^) : Th(Mon) — t- Set, it 
is straightforward to show, using the naturality of F^ and the diagrams obeyed by the 
definition of a strong monoidal functor, that the triple {FM, FmoF!^^M,M, FeoF^:{*)) 
is a well-defined monoid. From here it also can be shown that these two constructions 
are inverses up to isomorphism, and so we have bijections 



isomorphism classes of 
monoids 



isomorphism classes of 
< — > ^ strong monoidal functors 
Th(Mon) Set 



This shows that the strong monoidal functors from Th(Mon) to Set classify all 
monoids. 

In fact, the category Th(Mon) classifies not only monoids themselves, but also 
the maps between them. Indeed, given monoids X, X' and corresponding strong 
monoidal functors F, F', we also have a bijection 

monoid homomorphisms "| ^ ^ | monoidal natural transformations 



X X' j I F^ F' 

This bijection sends a monoid homomorphism ^9 : X — )■ X' to the monoidal natural 
transformation defined on M G Th(Mon) by (f : FM — )■ F'M. The requirement 
that monoid homomorphisms preserve the identity corresponds to the triangle IMNTll 
that monoidal natural transformations must obey, with the requirement that monoid 
homomorphisms preserve the monoid multiplication corresponds to the square [MNT2I 
It is further possible to show that these bijections respect composition of monoid 
homomorphisms and monoidal natural transformations. This shows that the category 
of monoids is equivalent to the category of strong monoidal functors from Th(Mon) 
to Set. It is in this strong sense that Th(Mon) classifies monoids, and for this reason 
we call this category the theory of monoids. 
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One advantage of this perspective is that we may now talk of monoid objects 
in other monoidal categories, which are often interesting structures in their own 
right. This often gives insight into the relationships between known mathematical 
structures. For example, the category of monoid objects in the monoidal category 
(Ab, (8), Z) of abelian groups with tensor product as monoidal product and Z as the 
monoidal unit can be shown to be precisely the category of rings. 

We will use this idea of defining generahsed monoid-like objects in categories other 
than Set in pursuing a categorical definition of a causal theory. In particular, we will 
be interested in commutative comonoid objects. 

Definition 1.12 (Commutative comonoid). As for in defining Th(Mon), fix a sym- 
bol M, and define the symmetric monoidal category Tli(CComon) to be the sym- 
metric monoidal category with objects tensor powers of M and morphisms generated 
by the swaps and the maps V : M ^ M M, 1 : M ^ I, subject to the relations 



and 




(coassociativity) 



(counitality) 



(commutativity) 



A commutative comonoid in a symmetric monoidal category C is a strong sym- 
metric monoidal functor Th(CComon) — )■ C. Abusing our terminology slightly, we 
will often just say that the image of M under this functor is a commutative comoniod. 
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Chapter 2 

Categorical Probability Theory 



Probability theory concerns itself with random variables: properties of a system that 
may take one of a number of possible outcomes, together with a likelihood for each 
possible observation regarding the property. We call the property itself the variable, 
and together the likelihoods for the observations form a probability assignment for 
the variable. As we will mainly concern ourselves with relationships between random 
variables, we have particular interest in rules that specify a probability assignment 
on one variable given a probability assignment on another — this can be seen as the 
latter variable having some causal influence on the former. 

In this chapter we will develop the standard tools to talk about all these things, 
but with emphasis on a categorical perspective. These categorical ideas originate 
with Lawvere [13], and were extended by Giry in [9]. We caution that the termi- 
nology we have used for the basic concepts in probability is slightly nonstandard, 
but predominantly follows that of Pearl [12] and the Bayesian networks community. 
Although it will not affect the mathematics, we will implicitly take a frequentist view 
of probability to complement our physical interpretation of causality. 

2.1 The category of measurable spaces 

The idea of a variable is captured by measurable spaces. These consist of a set X, 
thought of as the set of 'outcomes' of the variable, and a collection S of subsets of 
X obeying certain closure properties, which represent possible observations about X 
and which we call the measurable sets of X . We then talk of probability assignments 
on these measurable spaces via a function P : S — )■ [0, 1] satisfying some consistency 
properties. While the collection of measurable sets is often taken to be the power set 
V{X) when X is finite, for larger sets some restrictions are usually necessary if one 
wants to assign interesting collections of probabilities to the space. 
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Given a measurable set A, we think of the number P{A) as the chance that 
the outcome of the random variable with outcomes represented by X will lie in the 
subset A. As an example, the process of rolling a six-sided die can be described by the 
measurable space with set of outcomes X = {1,2,3,4,5,6} and measurable subsets 
S = V{X) all subsets of X. The statement that the die is fair is then the statement 
that the probability associated to any subset A C X is P{A) = ^\A\. 

We formalise this in the following standard way; more details can be found in [3] 
or [21], or indeed any introductory text to probability theory. 

Definitions 2.1 (a-algebra, measurable space). Given a set X, a a-algebra S on 
X is a set S of subsets of X that contains the empty set and is closed under both 
countable union and complementation in X. We call a pair {X, S) consisting of a set 
X and a cr-algebra E on X a measurable space. 

On occasion we will just write X for the measurable space (X, E), leaving the 
cr-algebra implicit. In these cases we will write Ex to mean the a-algebra on X. 

Example 2.2 (Discrete and indiscrete measurable spaces). Let X be a set. The power 
set V{X) of X forms a a-algebra, and we call (X, -P(X)) a discrete measurable space. 
At the other extreme, distinct whenever X has more than one element, is the a- 
algebra {0,X}. In this case we call (X, {0,X}) an indiscrete measurable space. 

Even beyond the two of the above example, it is not hard to find a-algebras: we 
may construct one from any collection of subsets. Indeed, we say that the a-algebra 
E(^) generated by a collection Q = {Gi}i^j of subsets of a set X is the intersection 
of all a-algebras on X containing Q. An explicit construction can be given by taking 
all countable intersections of the sets in Q and their complements, and then taking all 
countable unions of the resulting sets. We say that a measurable space is countably 
generated if there exists a countable generating set for it. 

Example 2.3 (Borel measurable spaces). Many frequently used examples of measur- 
able spaces come from topological spaces. The Borel a-algebra Bx of a topological 
space X is the a-algebra generated by the collection of open subsets of the space. 

Example 2.4 (Product measurable spaces). Given measurable spaces (X, Ex), {Y, Ey), 
we write E^ ® Ey for the a-algebra on X x F generated by the collection subsets 
{Axi?CXxF|y4G Ex, B G Ey}. We call this the product a-algebra of Ex and 
Ey, and call the resulting measurable space (X x F, Ex ® Ey) the product measurable 
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space of {X, Ex) and {Y, Ey). Note that as {AxB)n {A' x B') ^ {An A') x{Bn B') 
and {A x Bf = (A x B") U {A" x B),we may write 



Ue-f 



xBi) 



Ai e Ex, Si e Ey 



The product measurable space is in fact a categorical product in the category of 
measurable spaces. To understand this, we first must specify the notion of morphism 
corresponding to measurable spaces. Just as continuous functions reflect the open sets 
of a topology, the important notion of map for measurable sets is that of functions 
that reflect measurable sets. 

Definition 2.5 (Measurable function). A function / : X — )• F between measure 
spaces (X, Ex) and (F, Ey) is called measurable if for each A G Ey, f ^^{A) G Ex- 

We write Meas for the category of measurable spaces and measurable functions. 
It is easily checked that this indeed forms a category with composition simply com- 
position of functions. 

It is also not difficult to check that the product measurable space [X x Y, Ex (8>Ey) 
is the product of the measurable spaces {X, Ex) and {Y, Ey) in this category. As the 
projection maps ttx : X x Y ^ X and tty '■ X x Y ^ Y oi the set product are 
measurable maps, it is enough to show that for any measurable space {Z, E^) and 
pair of measurable functions / : {Z, Hz) ^ (X, Ex), and g : {Z, Hz) {X, Ex) the 
unique fTinction {f,g) : Z ^ X x Y given by the product in Set is a measurable 
function. Since for all countable collections {Ai x Bi}i^j of subsets of X x y we have 

(/,^)-'(U(^^ X ^^)) = W^9)-\A, X B,) = \J{r\A,)ng-\B,)), 

this is indeed true. 

Note also that any one point set 1 = {*} with its only possible a-algcbra {0, 1} 
is a terminal object in Meas. We thus may immediately view Meas as a symmetric 
monoidal category, with the symmetric monoidal structure given by the fact that 
Meas has finite products. The swaps o'x,y : X x Y ^ Y x X; {x,y) i— )■ {y,x) of 
Meas are the same as those of the symmetric monoidal category Set. We shall by 
default consider Meas as a symmetric monoidal category in this way. 

Similarly, we may also show that the full subcategories FinMeas and CGMeas 
with objects finite measurable spaces and countably generated measurable spaces 
respectively are also a symmetric monoidal category with monoidal product the cat- 
egorical product. 
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2.2 Measures and integration 



The reason we deal with measurable spaces is that these form the basic structure 
required for an object to carry some idea of a probability distribution. More pre- 
cisely, we deal with measurable spaces because they can be endowed with probability 
measures. 

Definitions 2.6 (Measure, measure space). Given a measurable space {X,T,), a 
measure /i on {X, E) is a function // : E — >■ R>o U {00} such that: 

(i) the empty set has measure /i(0) = 0; and 

(ii) if {Ai}i(zj is a countable collection of disjoint measurable sets then //(Uig/Aj) = 

Any such triple (X, E,//) is then known as a measure space. When IJ,{X) = 1, we 
further call /i a probability measure, and {X, E, /i) a probability space. 

We will have to pay close attention to the properties of the collections, in fact a- 
ideals, of sets of measure zero of probability spaces in the following. These represent 
possible observations of our random variable that nonetheless are 'never' observed, 
giving us very little information about their causal consequences. Very often we will 
pronounce functions equal 'almost everywhere' if they agree but for a set of a measure 
zero. More generally, we say a property with respect to a measure space is true almost 
everywhere or for almost all values if it holds except on a set of measure zero. We 
also say that a measure space is of full support if its only subset of measure zero is 
the empty set 0. Such spaces are necessarily countable measure spaces. 

Example 2.7 (Finite and countable measurable spaces). We shall say that a measur- 
able space (X, S) is a finite measurable space if S is a finite set. In this case there 
exists a finite generating set {Ai, . . . , An} for S consisting of pairwise disjoint subsets 
of X, and measures /j on (X, S) are in one-to-one correspondence with to functions 
m : {Ai, . . . , An} — > IR>0) with fJ^{A) = XIa ca "^(^*) ^'^^ measurable subsets A 
of X. Measures may thus also be thought of as vectors with non- negative entries in 
M", with probability measures those vectors whose entries also sum to 1. We may 
similarly define countable measurable spaces, and note that measures on these spaces 
are in one-to-one correspondence with functions m : N — ?■ M>o. 

Writing n for some chosen set with n G N elements, note that this suggests each 
finite measurable space is in some sense 'isomorphic' to n = {n,V{n)) for some n. 
Although this is not true in Meas, we will work towards constructing a category in 
which this is true. 
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We give two more useful examples of measures. 

Example 2.8 (Borel measures, Lebesgue measure). A Borel measure is a measure 
on a Borel measurable space. An important collection of examples of these are the 
Lebesgue measures on (M",i3]8n). These may be characterised as the unique Borel 
measure on such that the measure of each closed n-dimensional cube is given by 
its n-dimensional volume. See any basic text on measure theory, such as [2T[ Chapter 
1], for more details. 

When speaking of M as a measure space, we will mean M with its Borel a-algebra 
and Lebesgue measure. In particular, when referring to a real-valued measurable 
function, we shall take the codomain as having this structure. 

Example 2.9 (Product measures). Given measure spaces (X, Sx, jj) and (y, Sy, z/), we 
may define the product measure fix u on the product measurable space (X x Y, Ex ® 
Ey) as the unique measure on this space such that for all A G Sx and B G Sy, 

fixu{Ax B)= /i(A)z/(5). 

A proof of the existence and uniqueness of such a measure can be found in pT| 
Theorem 6.1.5]. 

One way in which measures interact with measurable functions is that measures 
may be 'pushed forward' from the domain to the codomain of a measurable map. 

Definition 2.10 (Push-forward measure). Let (X, Sx,/i) measure space, (F, Sy) 
measurable space, and / : X — )■ F be a measurable function. We then define the 
push-forward measure fif of fi along f to be the map Sy — )■ M given by 

f,f{B)=fiU-\B)). 

Note that fifiY) = fi{f^^{Y)) = /i(X), so the push-forward of a probability measure 
is again a probability measure. 

As causality concerns the relationships between random variables, we shall be 
particularly interested in measures on product spaces, so-called joint measures. An 
important example of a push-forward measure is that of the marginals of a joint mea- 
sure. These are the push-forward measures of a joint measure along the projections of 
the product space: given a joint measure space (X x Y, Ex ® Sy , fi) with projections 
TTx : X X y — )■ X and vry : X x F — )■ F , we define the marginal fix of /i on X to be the 
push forward measure of /i along vtx, and similarly for fxy We also say that we have 
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marginalised over Y when constructing the marginal fix from the measure /i. Note 
that the marginals of a joint probability measure are again probability measures. 

Observe that for each point x in its domain, a measurable function / : {X, S^) — 
{Y, Sy) induces a 'point measure' 



on its codomain [Y, Sy). From this point of view, the push-forward measure of some 
measure /i on (X, Sx) along / can be seen as taking the '//-weighted average' or 
'expected value' of these induced point measures on (F, Ey). More precisely, the 
push-forward measure may be defined as the integral of these point measures with 
respect to 

For the sake of completeness, we quickly review the definition of the integral for 
bounded real- valued measurable functions nonzero on a set of finite measure, but the 
reader is referred to Ash [3l §1.5] or Stein and Shakarchi |2T1 Chapter 2] for full detail. 

We first define the integral of simple functions. Let (X, S, /i) be a measure space, 
and let A be subset of X. We write xa : AT — )• M for the characteristic function 



and call a weighted sum ip = ^^^^^ CkXA^ of characteristic functions of measurable sets 



when this sum is finite. Note that this implies that the integral over X of the char- 
acteristic function of a measurable set A is just fi{A). 

Let now / be a bounded real-valued measurable function such that the set {x G 
I /(^) 7^ 0} is of finite measure. It can be shown there then exists a uniformly 
bounded sequence {v?n}neN of simple functions supported on the support of / and 

"'^A complementary perspective views the integral in terms of push-forwards, but only once we 
have define the standard notion of multiplying functions with measures to produce a new measure. 
Indeed, given a bounded real-valued measurable function / and a measure ^ on a measurable space 
{X, S), this new measure ffi is equal to / dfi on each A g E, and this allows us to see the integral 
J^fdfj. as the value, on the set *, of the push- forward measure of the measure XAfl^ along the 
unique map X ^ * to the terminal object. 





a simple function. The integral ip dfi oi a. simple function ip over the measurable 
set A with respect to n is defined to be 
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converging to / for almost all x. Using this sequence, we define the integral / 
of / over A with respect to ji to be 

/ fdji^Ymi I ipndijL. 
J A J A 

By our assumptions, this limit always exists, is finite, and is independent of the 
sequence {^Pn}ne'N- Where we do not write the domain of integration A, we mean 
that the integral is taken over the entire domain X of /. 

We will not discuss the technicalities of the integral further, but instead note that 
in the case of Lebesgue measure the notion of integration agrees with that of Riemann 
integration, and for finite measure spaces it can be viewed as analogous to matrix 
multiplication — this will be explained fully in the following section. Our examples 
will be limited to these cases. 

More generally, this idea of averaging measures will play a crucial role in how we 
reason about consequences of causal relationships. As an illustration, suppose that we 
have measurable spaces C and R, representing say cloud cover and rain on a given day 
respectively, and for each value of cloud cover — that is, each measurable set in C — we 
are given the probability of rain. We will assume this forms a real-valued measurable 
function / on C. If we are further given a measure n on C representing how cloudy 
a day is likely to be, we can 'average' over this measure to give a probability of rain 
on that day. This averaging process is given by the integral of / with respect to fi. 

Implicitly here wc are talking about conditional probabilities — for each outcome 
of the space C we get a measure on R. This idea will form our main idea of map 
between measurable spaces. 

2.3 The category of stochastic maps 

Measurable functions describe a deterministic relationship between two variables: 
if one fixes an outcome of the domain variable, a measurable function specifies a 
unique corresponding outcome for the codomain. When describing a more stochastic 
world, such as that given by a Markov chain, such certainty is often out of reach. In 
these cases stochastic maps — variously also called stochastic kernels, Markov kernels, 
conditional probabilities, or probabilistic mappings — may often be useful instead. 
These are more general, mapping outcomes of the domain to probability measures 
on, instead of points of, the codomain. 
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Definition 2.11 (Stochastic map). Let {X,T,x) and (Y, Ey) be measurable spaces. 
A stochastic map k : {X, Ex) — >■ {Y, Ey) is a function 

k{x,B) : X X Ey — ^ [0, 1] 

such that 

(i) for each x & X the function ■— k{x, — ) : Ey [0, 1] is a probabihty measure 
on Y; and 

(ii) for each measurable set S C y the function ks ■— , B) : X ^ [0, 1] is 
measurable. 

The composite of stochastic maps 

£ok:{X, Ex) ^ {Y, Ey) A {Z, Ez) 
is defined by the integral 

eok{x,c) = J i{-,c) dk^, 

where x ^ X and C G E^. That this is a well-defined stochastic map follows imme- 
diately from the basic properties of the integral. 

Note that these definitions are those suggested by our discussion at the close of the 
previous section: put more succinctly, a stochastic map is a measure-valued function 
(subject to a measurability requirement), and the composite i o k{x, C) of stochastic 
maps I and k is given by integrating the measures on the codomain (Z, E^) with 
respect to the measure kx on the intermediate variable (y, Ey). 

We give a few examples. 

Example 2.12 (Probability measures as stochastic maps). Observe that a stochastic 
map A; : 1 — >■ (X, E) is simply a probability measure on {X, E) . 

Example 2.13 (Deterministic stochastic maps). In the previous section we discussed 
how a measurable function / : (X, Ex) — {Y^ Ey) induces 'point measures' on its 
codomain. We can now interpret these as defining the stochastic map 5f : (X, Ex) — )■ 
(F, Ey) given by 

5^:XxEy ^[0,1]; 



1 if/(x)eS; 
iif{x)iB. 
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We call this the deterministic stochastic map induced by f. More generally, we call 
any stochastic map taking values in only in the set {0, 1} a deterministic stochastic 
map. 

Observe that given measurable functions / : (X, Ex) — {Y, Sy) and g : {Y, Sy) — )• 
{Z, T,z), the composite of their induced maps is given by 

S,o6f{x,C)= I 5,{-,C)d5,^. = 5s,.{g-\C)) = r 

Jy [0 iigof{x)^C, 

where x G X and C G S^. Thus 5gO 5f = 6gof. 

More generally, for a stochastic map k and measurable function / of the types 
required for composition to be well-defined, we have k o 6f{x,B) = k{f{x),B), and 
Sfok{x,B) = k{xJ-\B)). 

Example 2.14 (Stochastic matrices). Let X and Y be finite measurable spaces of 
cardinality n and m respectively. Note that if x, x' G X are such that x lies in a 
measurable set A if and only if x' does, then for any stochastic map : X — > F the 
measurability of /c^ for each B G Sy implies the measures k^ and k^i must be equal. 
Thus, with reference to Example 12. 7[ we may assume without loss of generality X and 
Y are discrete. Then, observing that all maps with discrete domain are measurable 
and recalling that probability distributions on a finite discrete measurable space m 
may be considered as vectors in R™' with non-negative entries that sum to one, we see 
that stochastic maps k : X ^ Y may be considered as m x matrices K with non- 
negative entries and columns summing to one. Indeed, the correspondence is given by 
having the yxth entry Ky^^ of K equal to k{x, {y}) for all x G X and y &Y. We call 
such matrices — matrices with entries in [0, 1] and columns summing to 1 — stochastic 
matrices. 

Let also Z he a. discrete finite measurable space, and let £ : F — )■ Z be a stochastic 
map, with corresponding stochastic matrix L. Then for all x G X and z E Z, we 
have 

iokix,{z})= [ ii-,{z})dk, = y2iiy,{z})k{x,{y}), 
■^^ yeY 

and writing this in matrix notation then gives 

i O fc(X, {Z}) = ^ Lz^yKy^^ = {LK)^^^. 

Thus our representation of finite stochastic maps as stochastic matrices respects com- 
position. This hints at an equivalence of categories. 
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We are now in a position to define the main category of interest: let tlie category 
of stochastic maps, denoted Stoch, be the category with objects measurable spaces 
and morphisms stochastic maps. It is straightforward to show this is a well-defined 
category. In particular, the associativity of the composition rule follows directly from 
the monotone convergence theorem [H Theorem 1], and for each object (X, S) of 
Stoch the delta function 6 : (X, S) — > (X, S) defined by 



— that is, the deterministic stochastic map induced by the identity function on X — is 
the identity map. 

Viewed with this new category, Example l2.13l defines a functor 6 : Meas — ^ Stoch. 
In fact, we may further endow Stoch with a symmetric monoidal structure such that 
this is a symmetric monoidal functor. For this we take the product of two objects to 
be their product measurable space, and the product 



of two stochastic maps k : (X, Sx) — ?■ {Y, Sy) and i : {Z, T^z) — ?■ {W, T^w) to be the 
unique stochastic map extending 



where x G X, z G 5 G Sy and D G T^w- This assigns to each pair (x, z) the 
product measure of and i^onYx W, and indeed results in a well-defined functor 
(8) : Stoch X Stoch — )■ Stoch. Using as structural maps the induced deterministic 
stochastic maps of the corresponding structural maps in Meas then gives Stoch the 
promised symmetric monoidal structure. 

Remark 2.15. Observe that any indiscrete a-algebra (*,{0, *}) is a terminal object 
in Stoch: from any other measurable space (X, S) there only exists the map 



Example 12.121 thus shows that the points of an object of Stoch are precisely the 
probability measures on that space. 

While Stoch has a straightforward definition and interpretation, the generality of 
the concept of a cr-algebra means that Stoch admits a few pathological examples that 




k ® i{{x, z),B X D) = k{x, B)i{z, D) 
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indicate it includes more than what we want to capture. For this reason, and for the 
clarity that simpler cases can bring, we will mostly work with two full subcategories of 
Stoch. The first is FinStoch, the category of finite measurable spaces and stochastic 
maps. Building on Example \2.14\ and as promised in Example \2.7\ this is monoidally 
equivalent to the skeletal symmetric monoidal category SMat with objects natural 
numbers and morphisms stochastic matrices. As categories of vector spaces are well 
studied, this characterisation gives much insight into the structure of FinStoch. 

The main disadvantage of FinStoch is that many random variables are not finite. 
One category admitting infinite measure spaces — and used by Giry [9j, Panangaden 
[15] . and Doberkat [6], among others — is the category of standard Borel spaces!^ 
which can be skeletalised as the countable measurable spaces and the unit interval 
with its Borel a-algebra. We will favour the less frequently used but slightly more 
general category CGStoch, the full subcategory of Stoch obtained by restricting the 
objects to the countably generated measurable spaces. This setting is general enough 
to handle almost all examples of probability spaces that arise in applications, but has 
a few nice properties that Stoch does not. In the next section we see one of them: 
the deterministic stochastic maps here are precisely those that arise from measurable 
functions. 



2.4 Deterministic stochastic maps 

Recall that the deterministic stochastic maps are those that take only the values 
and 1. These will play a crucial role in maps between collections of causally related 
random variables. The key reason for this is that these maps show much more respect 
for the structure of the measurable spaces than general stochastic maps. For example, 
for a stochastic map to be an isomorphism in Stoch, it must be deterministic. 

Proposition 2.16. Let k : (X, Sx) — (F, Sy) be an isomorphism in Stoch. Then 
k is deterministic. 

Proof. Our argument rests on the fact that if / : X — t- [0, 1] is a measurable function 
on a probability space (X, such that / f dfi = 1, then = 1. 

Write h for the inverse stochastic map to k, and fix i? G Sy. We begin by 
defining A = where we remind the reader that ks is the measurable function 

measurable space is a standard Borel space if it is the Borel measurable space of some Polish 
space. A topological space is a Polish space if it is the underlying topological space of some com- 
plete separable metric space. This category then has objects standard Borel spaces and morphisms 
stochastic maps between them. 
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k{—, B) : X [0, 1]. Note that we then have B C h'^^{l}, since for any y & B that 
h is the inverse to k gives / k{—,B)dhy = 1, so by the above fact hy{A) = 1, and 
hence y E 

It is enough to show that for any x G X, k{x, 5) = or 1. If x G A we are 
done: by definition then k{x, B) = 1. Suppose otherwise. Then, again as h and k are 
inverses, / h{—, A) dk^ = 0. But 



In the previous section, we showed that every measurable function induces a de- 
terministic stochastic map. One of the reasons that we prefer to work with countably 
generated measurable spaces is that in CGStoch the converse is also true. 

Proposition 2.17. Let (F, Sy) be a measurable space with Sy countably generated. 
Then a stochastic map k : X ^ Y is deterministic if and only if there exists a 
measurable function f : X Y with k = 6f. 

Proof. A proof can be found in [SI Proposition 2.1], but we outline a version here to 
demonstrate the use of the countable generating set, and point out that we assume 
the axiom of choice. 

We have seen that measurable functions induce deterministic maps. For the 
converse, let ^ be a countable generating set for Sy. Now for each x G X let 

= r\{Beg I k{x B)=i} ^- This is a measurable set as Q is countable, and has k^- 
measure 1 as its complement may be written as a countable union of sets of k^- 
measure zero. Choosing then for each x some y G -B^,., we define / such that f{x) = y. 
It is then easily checked that k = 6f, and / is measurable as each ks is. □ 

Remark 2.18. On the other hand, one need not look too hard for a deterministic 
stochastic map that is not induced by a measurable function when dealing with non- 
countably generated measurable spaceslfl Indeed, take any uncountable set X, and 
endow it with the a- algebra generated by the points of X. This means that set is 
measurable if and only if it or its complement is countable. It is then easily checked 

■^For fun, we note that if we further add the requirement that every subset of our codomain be 
measurable, then we do need to look quite hard. We say that a cardinal is a measurable cardinal 
if there exists a countably-additive two-valued measure on its power set such that it has measure 1 
and each point has measure 0. If we are looking for such measures, then our set has to be a strongly 
inaccessible cardinal [22] • These are truly huge; in some models of set theory they're too huge to 
exist! 




Thus k{x, B) = 0, as required. 



□ 
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that assigning countable sets measure and uncountable measurable sets measure 1 
defines a measure. This gives a deterministic stochastic map from the terminal object 
to X not induced by any measurable function. 

Remark 2.19. Note that the measurable function specifying a deterministic stochastic 
map need not be unique, so we should not view the deterministic stochastic maps 
as merely the collection of measurable maps lying inside CGStoch. As an example 
of this, consider the one point measurable space (*, {0, *}) and any other indiscrete 
measurable space (X, E). Then all of the |X| functions /:*—>■ X are measurable, 
and all induce the deterministic stochastic map 5/(*,0) = 0; = 1. In this 

way Stoch captures the intuition that every indiscrete measuarable space is the same. 

In particular, non-bijective measurable endofunctions can induce the identity 
stochastic map, so measurable spaces may be isomorphic in Stoch even if they are 
not isomorphic in Meas. This lets FinStoch admit the skeletahsation SMat, even 
while the classification of isomorphic objects in FinMeas is not nearly so neat. 

More abstractly, this shows that although our symmetric monoidal functor 5 : 
Meas — > Stoch is injective on objects, it is not faithful, and so we can not view 
Meas as a subcategory of Stoch. 

Although inducing deterministic stochastic maps from measurable functions is 
in general a many-to-one process, we may always take quotients of our measurable 
spaces so it becomes one-to-one. We briefly explore this idea in order to further our 
understanding of deterministic stochastic maps. 

Call two outcomes of a measurable space distinguishable if there exists a measur- 
able subset containing one but not the other, and indistinguishable otherwise. Indis- 
tinguishability gives an equivalence relation on the outcomes of a measure space. We 
may take a quotient by this equivalence relation, and use the quotient map to induce 
a a-algebra on the quotient set, defining a set in the quotient to be measurable if 
its preimage is. In this quotient space all outcomes arc distinguishable; we call this 
an empirical measurable space. The quotient map in fact induces an isomorphism in 
Stoch. 

We call a deterministic monomorphism in Stoch an embedding of measurable 
spaces. These arc deterministic stochastic maps induced by injective measurable 
functions on the cmpiricisations, and so may be thought of as maps that realise a 
measurable space as isomorphic to a sub-measurable space of another. We call an cpi- 
morphism in Stoch a coarse graining of measurable spaces. These arc deterministic 
stochastic maps induced by surjective measurable functions on the emipiricisations. 
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and so may be thought of as maps that remove the abihty to distinguish between 
some outcomes of the domain. 

The following proposition then gives a precise understanding of deterministic 
stochastic maps in CGStoch. 

Proposition 2.20. In CGStoch, every deterministic stochastic map may be factored 
as a coarse graining followed by an embedding. 

Proof. We may without loss of generality assume spaces are empirical. Then we may 
treat the deterministic stochastic maps as functions, and we know that each function 
factors into a surjection followed by an injection. □ 



2.5 Aside: the Giry monad 

To shed further light on the close relationship between Meas and Stoch, we mention 
a few results that first stated in [13], and proved in [9]. The main observation is that 
Stoch forms a relation-like version of Meas. More precisely, we observe that just 
as Rel is the Kleisli category for the power set monad on Set, Stoch is the Kleisli 
category for Giry monad on Meas. 

Recall that a monad on a category C consists of a functor T : C — )■ C and natural 
transformations rj : Ic ^ T and ^ : ^ T such that for all objects X G C the 
diagrams 

T{T{T{X)))^^^^T{T{X)) T{X) — ^^^T(T(X)) 

T{T{X)) —^^ > T(X) T(r(X)) -j^^ , T{X) 

commute. Also recall that the Kleisli Ct category of such a monad on C is the category 
with objects that of C, for all X,Y & C homsets homc(X, Ty), and composition of 
f* : Xt ^ Yt, g* -.Yt ^ Zt, defined by f : X ^ TY , g : Y ^ TZ , given by 
9* OT f* = {fioTgofy. 

As mentioned above, it can be checked that the functor mapping a set to its 
power set can be viewed as a monad on Set, and the Kleisli category for this monad 
is isomorphic to Rel. In the case of Meas and Stoch, we define the functor of the 
Giry monad V : Meas — )• Meas to be the functor taking a measurable space (X, S) 



MX 



and 




MX 
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to the set of all probability measures on {X, S) with the smallest cr-algebra such that 
the evaluation maps 

X S — > [0,1]; 

are measurable]^ The associated natural transformations of the monad are that 1 — )■ 
V sending a point to its point measure, and that V sending a measure on the 

set of measures to its integral. It can then be shown that this forms a well-defined 
monad, with Kleisli category Stoch. 

As Stoch is the Kleisli category for V, V can be factored through Stoch, and in 
fact through the functor 6 : Meas — > Stoch. This is done by defining the functor 
e : Stoch — )■ Meas sending a measurable space X to VX and a stochastic map 
k : X Y to the measurable function ek : VX — )■ VY defined by 

ek:VX — > VY; 

H I — )■ [B eT^Y ^ Jx k{x, B) dfi) . 

We then have an adjunction S -\ e 

s 

Meas ^^2! Zl Stoch 

e 

with composite j o S = V. 

Finally, note that if X is finite or countably generated then VX is finite or 
countably countably generated respectively too, so we may also view FinStoch and 
CGStoch as Kleisli categories of monads on FinMeas and CGMeas respectively. 



^We earlier saw hints that a stochastic map may be viewed as a measure-valued measurable 
function. We now see the precise meaning of this statement: a stochastic map is defined by a 
measurable function X — > VX. 



31 



Chapter 3 
Bayesian Networks 



In the first chapter we discussed a formalism for representing processes, while in the 
second we introduced a way to think of these processes as probabilistic. In this short 
third chapter we now add to this some language for describing selected probabilistic 
processes as causal. 

As in the case of probability, although the intuition for the concept is clear, any 
attempt to make precise what is meant by causality throws up a number of philo- 
sophical questions. We shall not delve into these here, but instead say that we will 
naively view a causal relationship as an asymmetric one between two variables, in 
which the varying of one — the cause — necessarily induces variations in the other — 
the effect. In particular, we think of a causal relationship as implying a physical, 
objective, mechanism through which this occurs. 

The structure we have chosen, Bayesian networks, has roots in graphical models 
in statistics, and was first proposed as a language for causality by Pearl in [17j , with 
special interest in applications to machine learning and artificial intelligence. Since 
then Bayesian networks have played a significant role in discussions of causality from 
both a computational and a philosophical perspective. This chapter in particular 
relies on expositions by Pearl and Williamson [26] . 

3.1 Conditionals and independence 

Much of the difficulty in the discussion of causality arises from the fact that causal 
relationships can never be directly observed. We instead must reconstruct such rela- 
tionships from hints in independencies between random variables. The key point is 
that if A causes B, then A and B cannot be independent. 
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Definition 3.1 (Independence). Let (X, E^) and (Y, Ey) be measurable spaces, and 
let P be a joint probability measure on the product space {X x Y, Ex ® Ey). We say 
that X and Y are independent with respect to P \i P is equal to the product measure 
of Px and Py, and dependent with respect to P otherwise. 

Independent joint distributions can also be characterised as those that are of the 
form 



for all A e Ex and B e Ey, where c : X — > y is a stochastic map that factors 
through the terminal object of Stoch. Indeed, to find such a c corresponding to any 
independent joint distribution, we may just take the stochastic map defined by 



for all X G X. This general idea gives a recipe for deconstructing, or factorising, a 
joint probability measure P into a marginal Px on one factor and a stochastic map 
Py\x : X ^ Y from that factor to the product of the others. We call this stochastic 
map a conditional for the joint measure. 

Definition 3.2 (Conditional). Let (X, Ex) and (y, Ey) be measurable spaces, and 
let P be a joint probability measure on the product space {X x Y, Ex <8) Ey). Then 
we say that a stochastic map Py|x '■ {X, "^x) {Y, Sy) is a conditional for P with 
respect to X if for all A e Ex and B e Ey we have 



Note that the above integral consequently defines a measure on X xY equal to P. 
Considering the marginals as stochastic maps 1 — > {X, Ex), this also implies that 



This says that Py|x is a stochastic map from X to Y that maps the marginal Px 
to the marginal Py. As there are many joint measures with marginals Px and Py, 
however, this is not a sufficient condition for Py|x to be the conditional for P with 
respect to X. 

While this is so, given a joint probability measure with marginals again probability 
measures, under mild constraints it is always true that there exists a conditional for 
it, and that this conditional is 'almost' unique. This is made precise by the following 




c{x,B)^Py{B) 




Py = Py|^ o Px. 
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proposition. Recall that given a measurable space (X, S), we call a measure /i on X 
perfect if for any measurable function / : X — > M there exists a Borel measurable set 
E C /(X) such that fi{f~^{E)) = /i(X). This proposition represents another reason 
why we will occasionally restrict our attention to CGStoch. 

Proposition 3.3 (Existence of regular conditionals). Let (X, Sx) and (F, Sy) be 
countably generated measurable spaces, and let fi be a measure on the product space 
such that the marginal fix is perfect. Then there exists a stochastic map k : (X, Sx) — 
(y, Sy) such that for all A G Sx and B G Sy we have 



J A 

Furthermore, this stochastic map is unique in the sense that if k' is another stochastic 
map with these properties, then k and k' are equal fix almost everywhere. 

Proof. Existence is proved in Faden [TJ Theorem 6]. Uniqueness is not difficult to 



The existence of conditionals gives rise to a more general notion of independence, 
aptly named conditional independence. 

Definition 3.4 (Conditional independence). Let (X, Sx), (F, Sy), (Z, S^) be mea- 
surable spaces, and let P be a joint probability measure on the product space X x 
Y X Z. We say that X and Y are conditionally independent given Z (with respect to 



(i) a conditional Pxy\z '■ {Z, Y^z) — > (X x F, Sx ® Sy) exists; and 

(ii) for each z G X and Y are independent with respect to the probability measure 

PxY\z{z,-). 

This notion gives us far more resolution in investigations of how variables can 
depend on each other, and hence in finding causal relationships. For example, the 
variables representing the amount of rain on a given day in London and in Beijing 
are dependent — on a winter day it on average rains more than a summer one in both 
cities — , but we can tell they are not causally related because they are condition- 
ally independent given the season. A key feature of Bayesian networks is that it 
allows us to translate facts about causal relationships into facts about conditional 
independence, and vice versa. 




show, but can be found in Vakhania, Tarieladze |23l Proposition 3.2]. 



□ 



P) if 
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The following lemma helps with further conceptualising conditional independence. 
In particular, conditions (iii) and (iv) say that if X and Y are conditionally indepen- 
dent given Z, then upon knowing the outcome of Z, the outcome of X gives no 
information about the outcome of Y, and the outcome of Y gives no information 
about the outcome of X. 

Lemma 3.5 (Countable conditional independence). Let X, Y, Z be countable discrete 
measurable spaces with a joint probability measure P such that the marginals on Z , 
XZ , Y Z each have full support. The following are equivalent: 

(i) X and Y are conditionally independent given Z . 

(ii) Px\z{z, {x})PY\ziz, {y}) = PxY\ziz, {{x, y)}) for all x e X , y eY , z e Z . 
(lii) Px\Yz{y, z, {x}) = Px\z{z, {x}) for all x e X , y eY , z e Z . 

(iv) Py\xz{x, z, {y}) = Py\z{z, {y}) for allxeX,yeY,zeZ. 

Proof. The equivalence of (i) and (ii) is just the definition of conditional independence, 
noting that in the discrete case the conditionals are uniquely determined by their 
values on individual outcomes. The equivalence of (ii), (iii), and (iv) follow from 
elementary facts in probability theory; a proof can be found in [12]. □ 

3.2 Bayesian networks 

We introduce Bayesian networks with an example. 

Example 3.6. Suppose that we wish to add a causal interpretation to a joint prob- 
ability measure on the binary random variables A = {a,-ia}, B = {b,-ib}, and 
C = {c, -ic} representing the propositions that, upon being presented with a food: 

A: you like, or appreciate, the food. 

B: the food is nutritionally beneficial. 

C: you choose to eat the food. 

Let these random variables have joint probability measure given by the table 
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a, b, c 


0.24 


a, b, -ic 





a, -lb, c 


0.18 


a, —lb, —ic 


0.18 


-lO, b, c 


0.06 


-la, b, -ic 


0.10 


-la, -lb, c 





-la, -lb, -ic 


0.24 



Intuitively, the causal relationships between our variables are obvious: liking a food 
influences whether you choose to eat it, and so does understanding it has health 
benefits, but otherwise there are no causal relationships between the variables — liking 
a food does not cause it to be more (or less) healthy. We shall represent these causal 
relationships by the directed graph 




where we have drawn an arrow from one variable to another to indicate that that 
variable has causal influence on the other. The above joint probability measure and 
graph comprise what we will later define as a Bayesian network. 

Note that we could not have chosen just any directed graph with vertices A, 
B, and C, as assertions about causal relationships have consequences that must be 
reflected in the joint probability measure. For example, as in the above graph neither 
A OT B cause of each other, nor have a common cause, we expect that A and B are 
independent with respect to the marginal Pab- This is true. Writing probability 
measures as n x 1 stochastic matrices with respect to the bases {x, -ix} for the binary 
variables and {ab, a-ib, -lab, -la-ib} for the variable AB, we have 



0.6 
0.4 



0.4 
0.6 



and hence 



Pa®Pb^ Pab 



/0.24\ 
0.36 
0.16 
\0.24/ 

Furthermore, the above graph suggests that the probability measure on C can be 
written as a function of the outcomes of both its causes A and B. We thus expect 
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that the measure has factorisation 

P{Xa,Xb,Xc) = / / dPc\ABdPAB= j j j dPc\ABdPBdPA, 

J(Xa,Xii)Jxc JxaJxhJxc 

where Xa E A, x^ G B, x^ G C. As these variables are finite, we might also write this 
requirement as 

P{Xa,Xb,Xc) = Pc\AB{Xa,Xb,{Xc})PB{{Xb})PA{{Xa})- 

Again, this is also true, with 

_ fl 0.5 0.375 

Motivated by this, we will later define a compatibility requirement in terms of the 
existence of a certain factorisation. Note that although in general a probability mea- 
sure will have many factorisations; the directed graph specifies a factorisation that 
we attach greater — causal — significance to. 

An advantage of expressing the causal relationships as a graph is that we may read 
from it other, acausal, dependencies. For example, while A and B are independent, 
the above graph suggests that if we know something about their common consequence 
C, this should induce some dependence between them. Indeed we find this does occur. 
Observe that 

/ 0.5 \ 

0.375 
0.125 

V J 

which indicates that of the foods you choose to eat, the foods you hke are more hkely 
to be unhealthy than those you dislike. 

More than this, however, marking certain relationships as causal affects our un- 
derstanding of how a joint probability measure should be interpreted; we will see an 
example of this in the next chapter. 

To make these ideas precise we introduce some definitions. Recall that a directed 
graph G = {V, A, s, t) consists of a finite set V of vertices, a finite set A of arrows, and 
source and target maps s,t : A ^ V such that no two arrows have the same source 
and target — precisely, such that for all a,a' E A either s{a) ^ s{a') or t{a) ^ t{a'). 
An arrow a G A is said to be an arrow from u to v if s{a) = u and t{a) = v, while 
a sequence of vertices Vi, . . . ,Vk G V is said to form a path from Vi to Vk if for all 
i = 1, . . . , k — 1 there exists an arrow G A from Vi to f j+i. A path is also called a 
cycle if in addition vi = v^- A directed graph is acyclic if it contains no cycles. 



Pab\c{c, 
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As demonstrated in the above example, directed acyclic graphs provide a depiction 
of causal relationships between variables; the direction represents the asymmetry of 
the causal relationship, while cycles are disallowed as variables cannot have causal 
influence on themselves. When we think of the set of vertices of a directed acyclic 
graph as the set of random variables of a system, we will also call the graph a causal 
structure. 

Given a directed graph, we use the terminology of kinship to talk of the relation- 
ships between vertices, saying that a vertex u is a parent of a vertex v if there is an 
arrow from u to v, u is an ancestor of v if there is a path from u to v, u is a child of 
V if there is an arrow from v to u, and w is a descendent of v if there is a path from u 
to V. We will in particular talk of the parents of a vertex frequently, and so introduce 
the notation 

pa(v) = {u & V \ there exists a ^ A such that s(a) = t{a) = v} 

for the set of parents of a vertex. When dealing with graphs as causal structures, 
we will also use the names direct causes, causes, direct effects, and effects to mean 
parents, ancestors, children, and descendants respectively. 

We say that an ordering {vi, . . . , v^} of the set V is an ancestral ordering if Vi is 
an ancestor of Vj only when i < j. 

Definition 3.7 (Bayesian network). Let G — {y,A,s,t) be a directed acychc graph, 
for each v & V let be a measurable space, and let P be a joint probability measure 
on YYvev ■^■v- We say that the causal structure G and the joint probabihty measure P 
are compatible if there exists an ancestral ordering of the elements V such that there 
exist conditionals such that 

P{Ai xA2X---xAn)= / . . . / Px„\MXn) ■ ■ ■ Px2\MX2)Px,- 

J Al J A2 J An 

A Bayesian network {G, P) is a pair consisting of a compatible joint probability 
measure and causal structure. 

A better understanding of this compatibility requirement can be gained from the 
examining following theorem. 

Theorem 3.8 (Equivalent compatibility conditions). Let G he a causal structure, let 
H^) \ V eV} he a collection of finite measurahle spaces indexed hy the vertices of 
G, and let P he a joint prohahility measure on their product space. Then the following 
are equivalent: 
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(i) P is compatible with G. 

(a) P(xi, . . . ,Xn) = nr=i -^^»|pa(x,)(pa(a;j), {xi}), where pa(xj) is the tuple consist- 
ing of Xj such that Xj G pa(Xj). 

(Hi) P obeys the ordered Markov condition with respect to G: given any ancestral 
ordering of the variables, each variable is independent of its remaining preceding 
variables conditional on its parents. 

(iv) P obeys the arental Markov condition with respect to G: each variable is inde- 
pendent of its nondescendents conditional on its parents. 

Proof. This follows from Corollaries 3 and 4 to Theorem 3.9 in Pearl [18]. □ 

As an illustration of the relevance of causal structure, we note that conditional 
independence relations between variables of a Bayesian network can be read from 
the causal structure using a straightforward criterion. Call a sequence of vertices 
Vi, . . . , ffc G an undirected path from Vi to Vk if for alH = 1, . . . , — 1 there exists 
an arrow ai & A from Vi to fj+i or from ViJ^i to Vi. An undirected paths fi,f2,f3 of 
three vertices then take the form of a 

(i) chain: w i — )■ ^2 — fs or v\ V2 ws, 

(ii) fork: fi "^2 — ^ "^s; or 

(iii) collider: f i — )■ f 2 W3. 

An undirected path from a vertex u to a vertex v is said to be d-separated by a set of 
nodes S if either the path contains a chain or a fork such that the centre vertex is in 

or if the path contains a collider such that the neither the centre vertex nor any 
of its descendants are in S. A set S then d-separates a set U from a set T if every 
path from a vertex in t/ to a vertex in T is c?-separated by S. 

The main idea of this definition is that causal influence possibly creates depen- 
dence between two random variables if one variable is a cause of the other, the vari- 
ables have a common cause, or the variables have a common consequence and the 
outcome of this consequence is known. In this last case, knowledge of the conse- 
quence 'unseparates' the two variables along the path through the known common 
consequence. On the other hand, any information gained through having a common 
cause is rendered moot if we have knowledge about a variable through which the 
causal influence is mediated. These ideas are captured by the following theorem. 
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Theorem 3.9. Let {G,P) be a Bayesian network. If sets U and T of vertices of G 
are d-separated by a third set S , then with respect to P, the product random variables 
Ylueu -^u and YltGT-^t '^'^^ conditionally independent given Yls&s-^s- 

Proof. See Verma and Pearl [21]. □ 

An example of this is the way the variables A and B are independent in Example 
13. 6[ but dependent conditional on C. 
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Chapter 4 
Causal Theories 



We now tie the elements of the last three chapters together to propose and develop a 
novel algebraic structure: a causal theory. After introducing these structures, we dis- 
cuss their models in various categories and how such models might be interpreted, and 
then look at a possibly confusing situation that causal theories and their associated 
graphical language help make lucid. 

4.1 The category associated to a causal structure 

We wish to fashion a category that captures methods of reasoning with causal rela- 
tionships. In this category, we will want our objects to represent the variables of a 
situation, while the morphisms should represent the ways one can deduce knowledge 
about one variable from another. Furthermore, as we will want to deal with more 
than one variable at a time, and the outcomes their joint variable may take, this 
category will be monoidal. 

As we may only reason about causal relationships once we have some causal rela- 
tionships to reason with, we start by fixing a set of symbols for our variables and the 
causal relationships between them. Let G — (V, A, s, t) be a directed acyclic graph. 
Prom this we construct a strict symmetric monoidal category Cq in the following way. 

For the objects of Cq we take the set of functions from V to the natural 
numbers. These may be considered collections of elements of the set of variables V, 
allowing multiplicities, and we shall often just write these as strings w of elements 
of V. Here the order of the symbols in the string is irrelevant, and we write for 
empty string, which corresponds to the zero map of N^. We view these objects as the 
variables of the causal theory, and we further call the objects which are collections 
consisting of just one instance of a single element of V the atomic variables of the 
causal theory. 
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There are two distinct classes of generating morphisms for Cq- The first class is 
the collection of comonoid maps: for each atomic variable v E V, we include mor- 
phisms W ^ : V ^ vv and : f — t- 0. These represent the ideas of duplicating some 
information about v, or forgetting some. The maps of the second class are called the 
causal mechanisms. These consist of, for each atomic variable v E V, a morphism 
[v|pa(t')] : pa{v) — )■ v, where pa{v) is the string consisting of the parents of v in any 
order, and represent the ways we may use information about a collection of variables 
to infer facts about another. We then use these morphisms as generators for a strict 
symmetric monoidal category, taking all products and well-defined compositions, sub- 
ject only to the constraint that for each v E V the pair ( V , T ) forms a comonoid. 
As the swaps are identity maps, these comonoids are immediately commutative. 

We call this category Cq the causal theory of the causal structure G. Morphisms 
of Cg represent ways to reason about the outcome of the codomain variable given 
some knowledge about the domain variable. 

As the causal mechanisms are labelled with their domain and codomain, there is 
usually no need to label the strings when representing morphisms of C with string 
diagrams. We also often do not differentiate between the comonoid maps with labels, 
as the context makes which comonoid map we are applying. The order in which 
we write the string representing the set pa(f ) corresponds to the order of the input 
strings. 

Example 4.1. The causal theory of the causal structure 




of Example 13.61 is the symmetric monoidal category with objects collections of the 
letters A, B, and C, and morphisms generated by counit and comultiplication maps 
on each of A, B, and C, as well as causal mechanisms [A] : ^ A, [B] : ^ B, 
and [C|Ai?] : AB C . We depict these causal mechanisms respectively as 



and C\AB 



We now list a few facts to give a basic understanding of the morphisms in these 
categories. These morphisms represent predictions of the consequences of the domain 



42 



variable on the codomain variable. As causal structures are acyclic — giving rise to 
a 'causal direction', or (noncanonical) ordering on the variables — , causal theories 
similarly have such a direction, and this puts limits on the structure. Indeed, one 
consequence is that a morphism can only go from an effect to a cause if it factors 
through the monoidal unit; this represents 'forgetting' the outcome of the effect, and 
reasoning about the outcomes of the cause from other, background, information. We 
say a map is inferential if it does not factor through the monoidal unit. 

Proposition 4.2. Let Cg be a causal theory, and let v,v' be atomic variables in Co- 
if there exists an inferential map v — )■ v' , then v is an ancestor of v' in G. 

Proof. We reason via string diagrams to prove the contrapositive. 

Observe that a generating map is inferential if and only if, in its string diagram 
representation, the domain is topologically connected to the codomain, and that this 
property is preserved by the counitality relation the comonoid maps must obey. Thus 
it is also true in general: a morphism in Cg is inferential if and only if, in all string 
diagram representations, the domain is topologically connected to the codomain. 

Note also that for all generating maps with string diagrams in which the domain 
and codomain are connected, the domain and codomain are nonempty and each ele- 
ment of the domain is either equal to or an ancestor of each element of the codomain. 
This property is also preserved by the counitality relation. Thus, if v is not an an- 
cestor of v', in all string diagram representations of a map v ^ v' the domain is not 
topologically connected to the codomain. Taking any such string diagram and con- 
tinuously deforming it by moving all parts of the component connected to the domain 
below all parts of the component connected to the codomain, we thus see that the 
map may be rewritten as one that factors through the monoidal unit. □ 

In fact the converse also holds: if v is an ancestor of f ', then there always exists 
an inferential map v — )■ v' . Indeed, if w,w' are objects in Cg containing each atomic 
variable no more than once — that is, when w, w' C V — , we can construct a map 
w — )■ in the following manner. 

1. Take the smallest subgraph Gw-,w' of G containing the vertices of w and w' , 
and all paths in V that terminate at an element of w' and do not pass through 



w 



.0 



^Precisely, this means we take the subgraph with set of arrows A^^^i C A consisting of all a £ A 
for which there exist ai, . . . , a„ with 

(i) s(ai) = t{a), 



43 



2. For each vertex v e let ky be the number of arrows of Ay,^y,r with source 
V. Then: 

(i) for each v E w take the string diagram representing the composition of 
ky — 1 comultiphcations on v or, when A;^ = 0, the counit of v. 

(ii) For each v & w' take the string diagram for [T;|pa(T;)] composed with a 
sequence of k^ comultiphcations on v. 

(iii) For each v e Vw-^w' \{w[J w') take the string diagram for [i>|pa(i')], com- 
posed with either a sequence of A;^, — 1 comultiphcations on v or, if ky — 0, 
composed with the counit on v. 

3. From this collection of string diagrams, create a single diagram by connecting 
an output of one string diagram to an input of another if the set Aw^^i contains 
an arrow from the indexing vertex of the first diagram to the indexing vertex 
of the second. 

Due to the symmetry of the monoidal category Cq and the associativity of the 
comultiplication maps, this process uniquely defines a string diagram representing a 
morphism from w to w' . Moreover, this map is inferential whenever there exists some 
V E w that is an ancestor of some v' G w'. These maps are in a certain sense the 
uniquely most efficient ways of predicting probabilities on w' using information about 
w, and will play a special role in what follows. For short, we will call these maps 
causal conditionals and write them maps [tu'Htu], or simply [w'] when w = 0. In this 
last case, we will also call the map [w'] the prior on [w']. 

Example 4.3. This construction is a little abstruse on reading, but the main idea is 
simple and an example should make it much clearer. Let G be the causal structure 

(ii) s{ai+i) = t{ai) for i = 1, . . . , n - 1, 

(iii) t{an) e w', 

(iv) sifli) ^ w iov i = 1, . . . ,n, 
and vertices 

Vw^w' = w\Jw'u{v&V\v = s{a) or t{a) for some a € A^^w'}- 

Note that for each v G Vw^w' \ w, the set of parents of v in this subgraph is equal to the set of 
parents of v in the whole graph G. 



B 



\ 



C 



\ 



D 



E 



and suppose that we wish to compute the causal conditional Step 1 gives 

the subgraph Gb^de 

A 




consisting of all paths to D or E not passing through B. 

Step 2 then states that the causal conditional comprises the maps 



id, 



DC 







E\DC 









C 


AB 









and we then compose these mimicking the topology of the graph Gb-^de to give the 
map 



[DE\\B] = 
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4.2 Interpretations of causal theories 



Causal theories express abstractly avenues of causal reasoning, but this serves no 
purpose in describing specific causal relationships until we attach meanings, or an 
interpretation, to the objects and morphisms of the theory. The strength of separating 
out the syntax of reasoning is that these interpretations may now come from any 
symmetric monoidal category. Stated formally, let C be a causal theory, and let T) be 
any symmetric monoidal category. Then a model of C in V, or just a causal model, 
is a strong monoidal functor M : C ^ T>. 

We explore the basic properties of causal models in a few categories. To demon- 
strate the basic ideas, we first take a brief look at models in Set and Rel; models 
in these categories will be useful for describing deterministic and possibilistic causal 
relationships respectively. While Meas is another obvious candidate setting for ex- 
amining causal models, we merely note that causal models here behave somewhat 
similarly to Set and move on to Stoch, the main category of interest. Here causal 
models generalise Bayesian networks. As Bayesian networks arc known to provide 
a useful tool for the discussion of causality, this lends support to the idea that the 
richer structure of causal models in Stoch do too. 

Models in Set 

Due to its familiarity, we begin our discussion of causal models with an examination 
of the forms they take in Set. In both Set and Rel the objects are sets. For the 
purposes of causal models, it is useful to view these sets as variables, with the elements 
the possible outcomes of the variable. With this interpretation, we can understand 
Set as a subcategory of Meas in which every measurable space is discrete, making 
it possible to measure any subset of the outcomes of each variable. Morphisms in 
Set — that is, set functions — then assign a single outcome of the codomain variable 
to each outcome of the domain variable, and so can be said to describe deterministic 
causal relationships. 

Given a causal theory C, a model of C in Set by definition consists of a strong 
monoidal functor M : C — > Set. To specify such a functor up to isomorphism, it is 
enough to specify the image of each atomic variable and each generating map, subject 
to the constraints that the generating maps chosen are well-typed with respect to the 
chosen images of the atomic variables, and that the images of the comultiplication and 
counit obey the laws of a commutative comonoid. Indeed, once these are specified, 
the values of the functor on the remaining objects and morphisms of C are, up to 
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isomorphism, determined by the definition of a strong monoidal functor. Note also 
that as long as the aforementioned constraints are fulfilled we have a well-defined 
strong monoidal functor. 

We first observe, as we will also in the case of Stoch, that each object of Set 
has a unique comonoid structure, and this comonoid is commutative. To wit, for 
each set X, there is a unique map X — > {*}, taking the product of this map and 
the identity map X ^ X gives the projection maps X x X — > X, and the only 
function X ^ X x X that composes to the identity with the projection map on 
each factor is the diagonal map x i-> {x,x). Moreover, choosing the diagonal map 
as a comultiplication indeed gives a commutative comonoid with this map. It is a 
consequence of this that we need not worry about the comonoid maps; choosing a set 
for each variable also chooses the comonoid maps for us. 

On the other hand, as the causal mechanisms need not obey any equations, so 
having defined a map on the objects, any choice of functions from the product set 
of all the direct causes of each variable to the variable itself then gives a model of 
the causal theory in Set. Each such function returns the outcome of its codomain 
variable given a configuration of the outcomes of its causes. In this sense a model of a 
causal theory in Set specifies how causes affect their consequences in a deterministic 
way. 

As maps from the monoidal unit in Set are just a pointings of the target set, 
the priors M[w\ of a causal model are just a choice of an outcome for each of the 
atomic variables va. w. In the case of an atomic variable with no causes, the prior 
M[v\ : {*} — >■ Mv is simply the causal mechanism M[v\pa{v)], and just picks an 
element of the set Mv. One might interpret this as the 'default' state of the variable 
V, and subsequently interpret the prior M[V] on the set of all variables V as the 
default state of all variables in the system. 

We shall see this as a general feature of models of causal theories; the priors specify 
what can be known about the system in some default state, while more generally the 
morphisms describe the causal relationships between variables even when not in this 
state. 

Models in Rel 

In the category Rel, we interpret a relation r : X ^ Y to mean that that if X has the 
outcome x & X, then Y may only take the outcomes y related to x via r. This is a 
possibilistic notion of causality, in which the outcomes of the causes do not determine 



47 



a single outcome of the effect variable as in Set, but only put some constraint on the 
possible outcomes of the effect variable. 

A curious property of the category Rel is that any relation X ^ Y may also 
be viewed as a relation Y X in a natural way — that is, Rel is equipped with 
a contravariant endofunctor that squares to the identity, or a dagger functor. This 
means that we have a way of reversing the direction any morphism we choose, and in 
this sense Rel itself is acausal. This makes causal models all the more useful when 
working in Rel, as they provide a way of privileging certain relations with a causal 
direction. 

For any object in Rel, we may view the functions forming the unique comonoid 
on this set in Set as relations, and hence have a commutative comonoid in Rel. An 
interesting collection of causal models M : C — )■ Rel in Rel are those in which all 
objects are given these comonoid structures. Note that a map s : {*} — )■ X from the 
monoidal unit to an object X in Rel is simply a subset of X and, assuming X has 
the comonoid structure in which the comonoid maps are functions, for any relation 
r : X — )► y the composite 



Thus when the comonoid maps of the model are those of Rel inherited from Set, it 
is easy to see that the priors [w] in Rel are given by the subset of the product set of 
the atomic variables in w consisting of all joint outcomes that are possible given the 
constraints of the causal mechanisms. 

In this setting, however, commutative comonoids are more general; for example, 
any collection of abelian groups forms a commutative comonoid on the union of the 
sets of elements of these groups We leave examination of causal structures for 
these other comonoid structures, and their interpretations, for later work. 

Models in Stoch 

We begin our discussion of causal models in Stoch by showing that for Stoch too we 
need not worry about selecting comonoid maps; the deterministic comonoid structure 
Stoch inherits from Set is the only comonoid structure on each object. 




is then equal to the set 



{{x,y) e X X Y \ X e S,y r^r x}. 
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Lemma 4.4. Each object of Stoch has a unique comonoid structure. Moreover, this 
comonoid structure is commutative. 



Proof. Fix an object (X, S) in Stoch. We first show the existence of a comonoid 
structure on X by showing that the stochastic maps 

V : (X,S) ^ (X X X,S® E); 

defined by 

V : X X (E^E) ^ [0, 1]; 

1 if {x, x) G A, 



{x,A) 

and 



if {x,x) ^ A, 
T:(X,E)-^(*,{0,*}) 



defined by 



T:Xx{0,*}^[O,l]; 

{x,*) I — > 1, 
{x, 0) ^ 0, 

form the comultiphcation and counit for a comonoid structure respectively. 

Indeed, observe that both these maps are deterministic stochastic maps, with V 
specified by the measurable function X — >■ X x X;x i— )■ {x,x) and T specified by 
X — > *;x i-> *. Prom here it is straightforward to verify that these functions obey 
coassociativity and counitality as functions in Meas, and hence these identities are 
true in Stoch. 

We next prove uniqueness. As the monoidal unit (*, {0, *}) is terminal, there is 
a unique stochastic map from (X, E) to the terminal object, and so T is the only 
possible choice for the counit of a comonoid on (X, E). Suppose that S : (X, E) — > 
(X X X, E (g) E) is a stochastic map such that (5, T) forms a comonoid. We will show 
in fact that for all x e X and A e E (g) E we have 



5{x, A) 



1 if {x,x) e A; 
if {x, x) ^ A, 



and so S — V. Note that as S^ — S{x, — ) is a probability measure on (X x X, E ® E) 
it is enough to show that S{x, A) — 1 whenever (x, x) e A. 
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To begin, note that counitahty on the right imphes that 



6 



so for all X G X and i? G S we have 



6^{B xX)= 6{x,B xX) = 



I XbxX dSr, 

XxX 



[ \j{-,-,B)d6, 



X 



J XxX 



id(x,s)(a;,5) 
Jl ifxG^; 
I if X ^ 5. 



Similarly, counitality on the left implies that for all x G X and i? G S we have 



We shall use these facts in the following. 

Fix X G X and let now A G S E be such that (x, x) G A. Recalling our 
characterisation of product cr-algebras in Example 12. 4^ we may assume A is of the 
form A = Ujg/(Cj x Di), where J is a countable set and Cj, Di & T, for all i E I . There 
thus exists G / such that (x, x) G Co x Dq, and hence x G Co and x G -Dq. Since we 
have shown above that this implies that 5x{Cq x X) = ^^^(X x Dq) = 0, we then have 



so 5(x, y4) = 1 as required. 

It remains to check that this comonoid is commutative. Recalling that the swap 
X : (X X X, S (g) S) — > (X X X, S (g) S) on (X, S) (g) (X, S) is the deterministic 
stochastic map given by X x X — > X x X; (x, ?/) {y,x), it is immediately clear 




6,{A) = 6,{U,ei{a X A)) 



> 6,{Co X Do) 

= 4((Co X X) n (X X Do)) 
= l-5,.((Co^xX)U(XxZ}^)) 

> l-((5,(Co^xX) + 5,(XxDo^)) 
= 1, 



that the comultiphcation V is commutative. 



□ 
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Arguing as for models of Set, given a causal theory C, strong monoidal functors 
P : C ^ Stoch are thus specified by arbitrary choices of measurable space for each 
atomic variable, and a subsequent arbitrary choices of causal mechanisms of the 
required domain and codomain. 

These interpretations of the causal mechanisms give rise to a joint probability 
measure compatible with the causal structure underlying the causal theory Indeed, 
this can be seen as the key difference between a model of a causal theory in Stoch and 
a Bayesian network: models of causal theories privilege factorisations, while Bayesian 
networks only care about the joint probability measure. 

Theorem 4.5. Let G be a directed acyclic graph with vertex set V and let 
P : Cg ^ Stoch be a model of the causal theory Cg in Stoch. Then the causal 
structure G and the probability measure defined by the prior P[V] are compatible. 

Proof. Recall that [V] is the prior on the collection consisting of one copy of each 
of the atomic variables V. For each v E V we have a measurable space Pv, and 
as P[y] : — > PV is a point of Stoch it defines a joint probability measure on the 
product measurable space PV. We must show that P[V] has the required factorisa- 
tion. 

To this end, choose some ancestral ordering of V, writing V now as {vi, . . . , Vn} 
with the elements numbered according to this ordering. By construction, the string 
diagram of the prior [V] consists of one copy of each causal mechanism [T;j|pa(T;i)] 
and ki copies of each comultiplication V , where ki is the number of children of the 
vertex Vi. As each Vi appears exactly once as the codomain of the causal mechanisms 
[i'j|pa(i'j)], the coassociativity of each comonoid and the rules of the graphical calculus 
for symmetric monoidal categories, show that any way of connecting these elements 
to form a morphism ^ V produces the same morphism. In particular, we may 
build [V] as the composite of the morphisms [V]i : Vi . . . Vi^i Vi . . .Vi defined by 



f if 2 ...Vi 





{vj ^ pa{vi)) {vj e p£i{vi)) 

ViV2 . . . Vi-i 
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In words, the morphism [V]i is the morphism vi . . .t'i-i — > Vi . . .Vi constructed by 
applying a comultiphcation to each of the parents of Vi, and then applying the causal 
morphism [vi|pa(i'j)]. Note that as we have ordered the set V with an ancestral 
ordering, all parents of Vi do lie in the set of predecessors of Vj. 

Observe now that given any stochastic map k : (X, Ex) — >■ (y,Sy), if V is the 
unique comultiphcation on X, then the composite 



JXxX 

for all X E X, A E Sx, B G Sy. Furthermore, if /i : * — )■ (X, Sx) is a measure on X, 
then 




is given by 





is given by 




for all AeJ^x, B e Ey. 

Thus, taking the image under P, each of the maps [V]i gives 



P[V]i{xi, . . .,Xi^i,Ai X • • • X A) = YlxAj{xj)P[vi\pa{vi)]{xi, 



for all Xj e Pvj, Aj e Ep„^., and composing them gives 



P[V]{A, X • • • X = P[Vl 0...0 P[V]2 o P[V]^{A^ x A, x ■ ■ ■ x A^) 

= / P[V]n o • • • o P[V]2{-,A2 X • • • X An)dP[v,\Mvi)] 




dP[vn\p£i{vn)] . . . dP[v2\pei{v2)]dP[vi\pci{vi)] 



for all Aj e T,Pvj- This is a factorisation of P[V] of the required type. 



□ 
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We call the pair {G, P[V]) the Bayesian network induced by P. Thus we see that, 
given a causal structure and any stochastic causal model of its theory, the induced 
joint distribution on the atomic variables of the theory forms a Bayesian network with 
the causal structure. On the other hand, if we have a Bayesian network on this causal 
structure, we may construct a stochastic causal model inducing this distribution, but 
only by picking some factorisation of our joint distribution. To iterate, this is the 
key distinction between Bayesian networks and stochastic causal models: a Bayesian 
network on a causal structure requires only that there exist a factorisation for the 
distribution respecting the causal structure, while a stochastic causal model explicitly 
chooses a factorisation. 

An advantage of working within a causal theory, rather than just with the induced 
Bayesian network, is that the additional structure allows neat representations of op- 
erations that one might want to do to a Bayesian network. The remainder of this 
dissertation comprises a brief exploration of this. We conclude this section by noting 
that the priors of the causal theory represent the marginals of the induced Bayesian 
network. 

Theorem 4.6. Given a model of P : C Stoch of a causal theory Cq and a set 
w of atomic variables of Cq, the prior P[w] is equal to the marginal on the product 
measurable space of the variables in w of the induced Bayesian network. 

Proof. We first note a more general fact: given a joint probability measure on X x 
Y expressed as a point in Stoch, marginalisation over Y can be expressed as the 
composite of this point with the product of the identity I on X and counit T on Y. 
Indeed, if//:*— ^XxFisa probabihty measure, then 

(lTo/i)(^)=/ \j{x,y,A)diJ,= XA{x)diJ,^ iJ,x{A). 

JxxY JxxY 

Thus the marginals of P[V] may be expressed by composing P[V] with counits on 
the factors marginalised over. We wish to show that these are the priors P[w] of C. 
Reasoning inductively, to show this it is enough to show that the composite of a prior 
with the product of a counit on one of its factors and identity maps on the remaining 
factors is again a prior. 

Let w be a set of atomic variables of C and let f G w. We will show that the 
composite of P[w] with the product of the counit on v and identity on w \ {v} is 
equal to the prior P[w \ {v}]. We split into two cases: when v has a consequence in 
G0^w, and when v has no consequences in For the first case, observe that 

G0^u) = G0^u)\{v} and ky > 1. Thus the priors [w] and [w \ {v}] are the same but 
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for the fact we compose with one extra comultiphcation after the causal mechanism 
[v|pa(f)] in the case of the prior [w]. Thus the composite of [w] with a counit on v is 
equal to [w \ {v}] by the counitality law on v. 

To deal with the second case we must work in Stoch and make use of the fact 
that the monoidal unit in Stoch is terminal. Indeed, as the monoidal unit in Stoch 
is terminal, in Stoch we have the equality of morphisms 

T 

f|pa(f) t f 

pa(f) pa{v) 

As V has no consequences in G0^w, the causal mechanism [t>|pa(f)] is not followed 
with any comultiplications in the construction of [w]. Thus, after composing P[w] 
with a counit on v we may invoke the above identity, and then invoke the counitality 
law for each of the parents of v. This means that P[w] is equal to the morphism 
constructed without the causal mechanism [f |pa(f )], and with one fewer comultipli- 
cations on each of the parents of v. But this is precisely the morphism P[w \ {v}]. 
This proves the theorem. □ 

4.3 Application: visualising Simpson's paradox 

One of the strengths of causal theories is that their graphical calculi provide a guide 
to which computations should be made if one wants to respect a causal structure, and 
in doing so also clarify what these computations mean. An illustration can be found 
in an exploration of confounding variables and Simpson's paradox. This section owes 
much to Pearl |l9j Chapter 6], extending the basics of that discussion with our new 
graphical notation. 

Simpson's paradox refers to the perhaps counterintuitive fact that it is possible for 
to have data such that, for all outcomes of a confounding variable, a fixed outcome 
of the independent variable makes another fixed outcome of the dependent variable 
more likely, and yet also that upon aggregation over the confounding variable the 
same fixed outcome of the independent variable makes the same fixed outcome of the 
dependent variable less likely. This is perhaps best understood through an example. 

Consider the following, somewhat simplified, scenario: let us imagine that we 
wish to test the efficacy of a proposed new treatment for a certain heart condition. 
In our clinical experiment, we take two groups of patients each suffering from the 
heart condition and, after treating the patients in the proposed way, record whether 
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they recover. In addition, as we know that having a heahhy blood pressure is also an 
important factor in recovery, we also take records of whether the blood pressure of 
the patient is within healthy bounds or otherwise at the conclusion of the treatment 
programme. This gives three binary variables: an independent variable T = {t, -it}, 
a dependent variable R = {r, -ir}, and a third, possibly confounding variable B = 
{b, -16} , where will think of the variables as representing the truth or otherwise of the 
following propositions: 

T: the patient receives treatment for heart condition. 

R: the patient has recovered at the conclusion of treatment. 

B: the patient has healthy blood pressure at post-treatment checkup. 

Suppose then that our experiment yields the data of Figure 14. 1[ 

T TB 



R 





t 


-it 




t,b 


-^t,b 


t, -16 -it, -16 


r 

-IT 


39 
61 


42 
58 


r 

R 

-ir 


30 
10 


40 
40 


9 2 
51 18 



Figure 4.1: The table on the left displays the experiment data for the treatment and 
recovery of patients, while the table on the right displays the experiment when further 
subdivided according to the blood pressure of the patients measured post-treatment. 

In these data we see the so-called paradox: for both patients with healthy and 
unhealthy blood pressure, treatment seems to significantly improve the chance of 
recovery, with the recovery rates increasing from 50% to 80% and from 10% to 15% 
respectively when patients are treated. On the other hand, when the studies are taken 
a whole, it seems treatment has no significant effect on the recovery rate, which drops 
slightly from 42% to 39%. Given this result, it is not clear whether the experiment 
indicates that treatment improves or even impairs chance of recovery. Should we or 
should we not then recommend the treatment? 

The answer depends on the causal relationships between our variables. Suppose 
that the treatment acts in part via affecting blood pressure. Then the causal structure 
of the variables is given by the graph 
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T 




B 



R 



In this case we should make our decision with respect to the aggregated data: else 
when we condition on the post-treatment blood pressures we eliminate information 
about how the treatment is affecting blood pressure, and so eliminate information 
about an important causal pathway between treatment and recovery. We therefore 
should not recommend treatment — although when we control for blood pressure the 
treatment seems to improve chances of recovery, the treatment also makes it less likely 
that a healthy blood pressure will be reached, offsetting any gain. 

On the other hand, suppose that the treatment works in a way that has no effect 
on blood pressure. Then from the fact that the blood pressure and treatment variables 
are not independent we may deduce that the blood pressure variable biased selection 
for the treatment trial, and so the causal structure representing these variables is 



B 




R 



Here we should pay attention to the data when divided according to blood pressure, 
as by doing this we control for the consequences of this variable. We then see that no 
matter whether a patient has factors leading to healthy or unheathly blood pressure, 
the treatment raises their chance of recovery by a significant proportion. 

These ideas are codified in the corresponding causal theories and their maps, with 
the causal effect of treatment on recovery expressed via the causal conditional [-R||T]. 
For the first structure, let the corresponding causal theory be Ci, and the data give 
the following interpretations of the causal mechanisms P : Ci — )■ FinStoch: 

\T7 = (^-^^ \Bm = (^-^ 
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RTB 



0.75 0.15 0.5 0.1 
0.25 0.85 0.5 0.9 



Here we have written the maps as their representations in SMat with respect to the 
basis ordering given in our definition of the variables. The causal conditional P[i?| |T] 
is then 




0.39 
0.61 



0.42 
0.58 



T 

The elements of the first row of this matrix represent the probability of recovery 
given treatment and no treatment respectively, and so this agrees with our assertion 
that in this case we should view the treatment as ineffective, and perhaps marginally 
harmful. 

On the other hand, writing the corresponding causal theory to the second causal 
structure as C2, in this case the data gives the stochastic model Q : C2 ^ FinStoch 
defined by the maps 



0.5 
0.5 



0.6 
0.4 



RTB 



0.75 0.15 0.5 0.1 
0.25 0.85 0.5 0.9 



We then may compute the causal conditional [R\\T] to be 

R 




0.51 0.34 
0.49 0.66 



This again agrees with the above assertion that with this causal structure the treat- 
ment is effective, as here the probability of recovery with treatment is 0.51, compared 
with a probabihty of recovery without treatment of 0.34. In this case the map IT"] 
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is the only inferential map from T to R; it thus may be seen as the only way to deduce 
information about R from information about T consistent with their causal relation- 
ship. Thus within the framework given by the causal theory, there is no possible way 
to come to the wrong conclusion about the efficacy of the treatment. 

In the first case, however, there is one other map; we may infer information about 
recovery given treatment via 

R 



0.51 0.34 
0.49 0.66 




As suggested by the form of the string diagram, this may be interpreted as the chance 
of recovery if the effect of the treatment on blood pressure is nullified, but nonetheless 
assuming that the proportion of patients presenting healthy blood pressures at the 
conclusion of the treatment was typical. In particular, this indicates that if it was 
inevitable that a group of patients would end up with healthy blood pressure levels 
in the proportion specified by [B\T] o [T], then the treatment would be effective for 
this group. 

Note that the string diagrams themselves encode the fiow of causal infiuence in 
their depictions of the conditionals. In doing so they make the source of confusion 
patently clear: we may judge the effect of treatment on recovery in two different ways, 
one in which use information about how treatment affected blood pressure, and one 
in which we forget this link and assume the variables are unrelated. 

Finally, observe that causal structures are thus very relevant when interpreting 
data, and awareness of them can allow one to extract information that could not 
otherwise be extracted. Indeed, although under the second causal structure the fact 
that the blood pressure variable biased our selection procedure for treatment — making 
it more likely that we treated those with unhealthy blood pressure — can be seen as 
ill-considered experiment design, we see that nonetheless an understanding of the 
causal structure allowed us to recover the correct conclusion from the data. This 
becomes critically useful in cases when we do not have the abilities to correct such 
biases methodologically, such as when data is taken from observational studies rather 
than controlled experiments. 
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Chapter 5 



The Structure of Stochastic Causal 
Models 

Our aim through this dissertation has been to develop tools to discuss causality, and 
in particular causal relationships between random variables. Our claim is now these 
are well described by stochastic causal models: models P : C — > CGStoch of a 
causal theory C in CGStoch. Indeed, we have seen these are shght generahsations of 
Bayesian networks in which the factorisation of the joint distribution is made explicit. 
One advantage of moving to this setting is that we now have a natural notion of map 
between causal models: a monoidal natural transformation between their functors. 
We begin this chapter by exploring these, before using the knowledge we gain to look 
at the existence or otherwise of some basic universal constructions in the category of 
stochastic causal models. 

5.1 Morphisms of stochastic causal models 

Fix a causal theory C. Although we have so far had no problems discussing models 
of causal theories in Stoch, we shall define the stochastic causal models of C to be 
the objects of the category CGStoch55J^^ of strong symmetric monoidal functors 
C — > CGStoch. This more restrictive definition allows for a more well-behaved 
notion of maps between stochastic causal models. Indeed, we take the notion of 
morphism in CGStoch^^jy^ — a monoidal natural transformation between functors — 
to be the notion of map between stochastic causal models. As we will see in this 
section, these are much like deterministic stochastic maps. Our aim will be to define 
the terms in, and then prove, the following theorem: 

Theorem 5.1. Morphisms of stochastic causal models factor into a coarse graining 
followed by an embedding. 
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To this end, let P, Q : C — > CGStoch be stochastic causal models, and let 
q; : P Q be a monoidal natural transformation. By definition, this means we have 
a collection of stochastic maps a^, : Pw — > Qw such that for all variables w,w' e C 
and all morphisms r : Pw — > Pw' the following diagrams commute: 



Pw- 



Pr 



Qw 



P0 



Qr 

Pw'^^Qw' 

Pw (g) Pw' 



Pww' ■ 



* 



^- Qw ® Qw' 



Qww' 



We can, however, write this definition a bit more efficiently. 

As P0 and QOi are isomorphic to the monoidal unit, and as the monoidal unit * 
of CGStoch is terminal, the above triangle gives no constraints on the morphisms. 
The lower square specifies the relationships between the maps and a^/ and the 
map CKu,^' on the product variable. Due to this, it suffices to define the natural 
transformation a only on the atomic variables v of C, and let the commutativity of 
the square specify the maps on the remaining variables. It thus remains to ensure 
that our maps on the atomic variables v satisfy the defining square of a natural 
transformation. 

We first consider the constraints given by the comonoid maps. The counit maps 
provide no constraint: since is terminal, the diagram 

Pv — ^^Qv 



T 



P0 



T 



always commutes. On the other hand, the comultiplication maps heavily constrain 
the q;„: they require that 



Pv 



-^Qv 



V 



Pv(S> Pv- 



V 



-^Qv®Qv 
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or in string diagrams: 




The following lemma shows that this is true if and only if each must be determin- 
istic. 

Lemma 5.2. A stochastic map is a comonoid homomorphism if and only if it is 
deterministic. 

Proof. Let /c : X — )■ y be a stochastic map. As the monoidal unit is terminal in 
CGStoch, all stochastic maps preserve the counit. We thus want to show that 



k k 



J_ 




if and only if k is deterministic. 

Now, given x E X and B e Ey, the left hand side of the above equality takes 
value 

{k^k)o V{x, BxB) = j k® k{-, -,B xB)dV^ 

= I k{-,B)k{-,B)d'^^ 
^k{x,Bf, 

while the right hand side equals 

\^ok{x,BxB)^ J V{-,B X B)dk^ 

^ J Xb dka: 
= k{x,B). 



Thus if A; is a comonoid homomorphism, then k{x, B)"^ = k{x, B), and hence k{x, B) ~ 
or 1. This shows that k is deterministic. Conversely, if k is deterministic, then 
k{x, By — k{x, B) for all x e X, S e Ey, so /c is a comonoid homomorphism. □ 

Summing up, a morphism a : P ^ Q oi stochastic causal models is specified by a 
collection {ay}veVc of deterministic stochastic maps : Pv — >■ Qv such that for all 
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atomic variables v G Vc the squares 



P{pa{v)) 



^Q{pa{v)) 



P[v\pa{v)] 



Qlv\pa(v}] 



Pv 



commute. 



We say that a morphism a of stochastic models of C is an embedding if for all 
objects f of C the deterministic stochastic map is an embedding. Similarly, we say 
that a morphism a of stochastic models of C is a coarse graining if for all objects v of 
C the deterministic stochastic map a„ is a coarse graining. Theorem 15.11 now follows 
from Proposition \2.20\ with the causal model it factors through having the induced 
structure. 

We caution that despite the similar terminology to deterministic stochastic maps, 
the situation here differs as stochastic causal models consist of much more data than 
measurable spaces, and so the compatibility requirements a morphism must obey 
here are much stricter. For example, while in CGStoch it is always possible to find a 
deterministic map between any two objects, this is rarely possible in CGStoch^^^. 

Let P, Q : C — 7- CGStoch be stochastic causal models, and let a : P ^ Q be a 
morphism between them. Then for any prior [w] of C, the diagram 



commutes. This says that the pushforward measure of any prior P[w] along the 
deterministic stochastic map a^, must agree with Q[w]. No such map exists, for 
example, when Pw, Qw are binary discrete measurable spaces and P[?i'], Q[w] have 
matrix representations 



with p, g G [0, 1] and q 0, p, 1 — p, or 1. 

As diagrams involving all morphisms of C, and not just the priors, are required 
to commute, still more constraints apply. Although there are exceptions for finely- 
tuned parameters, it is generically true that if one wishes to find a coarse graining 



Pw 



> Qw 




* 
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: P Q between two causal models, then two outcomes of a measurable space 
Pw can be identified by the map ^y^ only when they define the same measure on the 
codomain for all maps P(p : Pw — > Pw', where (p : w ^ w' is any morphism of C with 
domain w. The intuition here is that coarse grainings allow us to group outcomes 
and treat them as a single outcome. But for this to be possible, the outcomes must 
behave similarly enough to treat as one. Since outcomes now have consequences, we 
have much higher ability to see differences between them, and hence coarse grainings 
are far more restrictive for causal models than for measurable spaces. 

As embeddings do not identify distinguishable outcomes of the domain, we need 
not worry about such complications in understanding restrictions on their construc- 
tion. Nonetheless, if e : P Q is an embedding of stochastic causal models, then 
since the push- forward measure of any prior P[w] along the deterministic stochastic 
map eu, must agree with Q[w], any measurable set of Qw not intersecting the image of 
CGStoch-embedding must have Q[w]-measure zero. Furthermore, for the naturality 
squares of the morphisms (p : w ^ w' oi C to commute, each map Q(p must behave as 
P(fi on the image of eu,- This means that an embedding e : P ^ Q forces the priors 
of P and Q to be the 'same' up to sets of measure zero. 

5.2 Basic constructions in categories of stochastic 
causal models 

In this final section we continue our characterisation of categories of stochastic causal 
models by exploring a few universal constructions. In particular, we show that these 
categories have a terminal object, but no initial object, and in general no products 
or coproducts either. Again fix a causal theory C. 

Proposition 5.3. The functor T : C ^ CGStoch sending all objects of C to the 
monoidal unit * of CGStoch and all morphisms of C to the identity map on * is a 
terminal object in the category CGStoch^5J^^ of stochastic causal models ofC. 

Proof. Note first that, since the monoidal product of * with itself is again *, the 
constant functor T : C — > CGStoch is a well-defined stochastic causal model. 

Let P : C — > CGStoch be a stochastic causal model of C. We construct a 
monoidal natural transformation a : P ^ T. Then for each w & C, define aw : Pw — > 
Tw = * to be the unique stochastic map Pw — > *. This exists as * is terminal in 
CGStoch. Furthermore, from the fact that * is terminal in CGStoch it is immediate 
that for each morphism of C the required naturality square commutes. As these maps 
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to the terminal object are each deterministic, we thus have a well-defined morphism 
of stochastic causal models. 

By construction it is clear that this is the unique morphism of causal models 
P — > T. This proves the proposition. □ 

The functor T is an example of what we will call a trivial model. Given a 
measure space (X, we define the trivial model on (X, to be the func- 

tor : C — )■ CGStoch sending each atomic variable f of C to X, and each causal 
mechanism |j5a(f)] to the map T^{pa{v)) — > * — ^ Tv = X assigning to each 
element of T^{pa{v)) the measure fi. This represents the situation in which all the 
atomic variables arc the same random variable with the same prior, and have no 
causal influence on each other. We shall use these to show the non-existence of an 
initial object, products, and coproducts. 

Proposition 5.4. The category CGStoch55J^^ of stochastic causal models of C has 
no initial object. 

Proof. We prove by contradiction. Suppose that I : C ^ CGStoch is an initial 
object of CGStoch^^jy^. 

Let {B,V{B),u) be the discrete measure space with two outcomes {61,62} such 
that the probability of each outcome is one half, and let Tj^ be the trivial model of this 
space. Note that as u has full support, the only measure space (X, E,/x) for which 
there exists a monic deterministic stochastic map k : X ^ B such that 



X >B 




* 



commutes is {B,V{B),i') itself. In this case k must also be an epimorphism in 
CGStoch; k is either the identity map, or the map s : B ^ B induced by the 
function sending hi to 62 and 62 to hi. Thus any map of stochastic causal models 
a : P ^ Tj, with codomain must be defined objectwise by coarse grainings of 
CGStoch, and hence itself be an epimorphism in CGStoch55J^^. 
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In particular, the unique morphism of stochastic causal models 6t^ : I T^, must 
be an epimorphism. Since, by uniqueness, the diagram 

rp a rp 

I 

must commute for any morphism of stochastic models a : ^ T^,, this implies 
that the only such morphism is the identity map. But it is readily observed that 
defining — s : B ^ B ior each atomic variable v oi C gives a monoidal natural 
transformation a -.Ti, ^ T^, not equal to the identity. We thus have a contradiction, 
and so CGStoch55J^^ has no initial object, as claimed. □ 

Exam,ple 5.5 (Two objects which have no product). We again work with T,^, where 
{B. V{B),v) be the discrete measure space with two outcomes {hi. 1)2} such that the 
probability of each outcome is one half. We will see that the product of T^, with itself 
does not exist. Suppose to the contrary that a product X does exist, with projections 
7ri,7r2 : X =^ T^. We assume without loss of generality that each Xv is empirical; 
recall that this means that each point of the set Xv is measurable. 

Given the identity monoidal natural transformation id : T^, ^ T^^, there exists a 
unique monoidal natural transformation 6 -.T^ ^ X such that 

T 
^X 

^ T 

commutes. This shows that for each atomic variable v e C, Xv has an outcome Xi 
of measure \ such that 9y is induced by a function mapping hi to Xi, and 7ri„ and 
'K2v are induced by functions mapping xi to hi. Similarly, we also have X2 G Xv of 
measure | such that 9^ is induced by a function mapping 62 to X2, and tti^ and t:2v 
are induced by functions mapping X2 to 62- Note that each Xv then has no other 
outcomes of positive measure. 

Let now a : 71^ =^ 7]^ be the monoidal natural transformation of the previous proof 
defined by = s : S — > S, where s is induced by the function sending hi to 62 and 
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^2 to 61. Then there exists a unique monoidal natural transformation 9' : Tjj ^ X 
such that 

T 




T 

commutes. Now as v{h\) = 2^(62) = ^, for each atomic v & C the stochastic map 
6[, : B Xv must then be induced cither by the function mapping bi to Xi and 62 to 
X2, or by the function mapping bi to X2 and 62 to Xi. Both cases give a contradiction. 
In the first case, the composite (tti o 6')^ is then equal to the identity map on B, 
contradicting the definition of a. In the second case, the composite {^2 o 9')^ is equal 
to s, and hence not equal to id^ as required. 

Thus no product stochastic causal model T^, x T^, exists. 

Example 5.6 (Two objects which have no coproduct). Let T be the terminal ob- 
ject of CGStoch^^^, and let Tx be the trivial model on the Lebesgue measure 
([0, 1], B[o,i], A) of the unit interval. We show that these two stochastic causal models 
have no coproduct in CGStoch^^ji^. To this end, suppose that a coproduct X does 
exist, with injections ii'.Tx^ X and 12 :T^^ X. We again assume without loss of 
generality that each Xv is empirical. 

To show the difficulties in constructing a coproduct, we use the test object 
defined as the trivial model on the measure space {B,V{B), jj) with B — {61,62}, 
= 1 and //(62) = 0. Note that there is a unique map /3 : T =^ T^; this is 
induced on each atomic v & C by the function sending the unique point * of Tv to 
bi & B — Tf^v. This is the only such map as, since * is a point of measure 1, its 
image must be a point measure on a point of measure 1. Note also this implies that 
for each v e C the set Xv consists of a point Xi such that a measurable subset of Xv 
has X[T;]-measure 1 if xi e Xv, and measure otherwise. 

Consider now maps a : Tx These are defined by, for each atomic v e C, 

a choice of a Lebesgue measure subset of [0, 1]. We then may let : [0, 1] — > S 
be induced by the function mapping each element of this measure zero subset to 62, 
and then remaining elements to 61. In particular, for each p e [0, 1], let ctp : T\ =^ 
be the monoidal natural transformation such that for all atomic v & C the map 
{cKp)y : [0, 1] — > S is induced by the function mapping p — > 62 and each element of 
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[0, 1] \ {p} to 61. By the universal property of the coproduct, for each such map there 
exists a unique map 9 : X ^ such that 




commutes. This impUcs that for each atomic v the function inducing the deterministic 
stochastic map ii^ : [0, 1] — )> Xv does not map p to Xi. But this imphes that the push- 
forward measure of A along iiy is the zero measure, contradicting the commutativity 
of the diagram 

[0, 1] > Xv 




This shows that T and Tx do not have a coproduct in CGStoch^ 
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Further Directions 



In arriving at this point we have seen that causal theories provide a framework for 
reasoning about causal relationships, with the morphisms of these categories repre- 
senting methods of inference, the topology of the string diagrams for these morphisms 
providing an intuitive visualisation of information flow, and the stochastic models of 
causal theories slight generalisations of Bayesian networks. 

There are many directions in which this study could be continued. One obvious 
avenue for further exploration is to continue the work of the previous chapter in 
the characterisation of categories of stochastic causal models. This should, at the 
very least, provide additional insight into relationships between Bayesian networks. 
Although we have seen that products and coproducts do not exist in the category 
of stochastic causal models, and it is likely similar arguments show other types of 
limits and colimits do not exist, one suggestion is to examine ideas of families and 
moduli of stochastic causal models. For this, call Bayesian networks equivalent if 
there are measure-preserving measurable functions between their joint probability 
distributions that compose to the identity almost everywhere, and call two stochastic 
causal models equivalent if their induced Bayesian networks are equivalent. It may 
then be possible to put some geometric structure on the set of stochastic causal 
models, and subsequently define a moduli problem. This will perhaps generalise 
work on the algebraic geometry of Bayesian networks, such as that in [8] . One could 
also explore the relationships between the categories of stochastic causal models of 
distinct causal theories. Here one might define a functor between such categories if 
there exists a map of directed graphs between their underlying causal structures. 

A weakness of causal theories is that their morphisms only describe predictive 
inference; reasoning that infers information about causes from their consequences. 
In general we are interested in other modes of inference too, and extension of the 
framework to allow discussion of these would make it much more powerful. In the 
probabilistic case, it can be shown that all conditionals of a joint distribution can be 
written as morphisms if one can also write Bayesian inverses of the causal conditionals. 
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Given variables w,w', these may be characterised as maps k : w' ^ w such that 



Methods for constructing such maps often run into issues of uniqueness on outcomes 
of measure zero in the prior. While in Coecke and Spekkens [1] give a method for 
realising the Bayesian inverse of a finite stochastic map as transposition with respect 
to a compactness structure in Mat when the prior is of full support, and Abramsky, 
Blute, and Panangaden fl] give a category, similar to Stoch, in which Bayesian 
inversion may be viewed as a dagger-functor, work remains to be done to merge these 
ideas with those presented here. 

Another topic deserving investigation is suggested by the fact that, although a 
joint probability distribution is compatible with a causal structure if it satisfies the 
required set of conditional independence relations, not every possible combination of 
conditional independence relations of a set of random variables can be represented by 
a causal structure. Indeed, the number of combinations of conditional independence 
relations grows exponentially in number of atomic variables, while the number of 
causal structures grows only quadratically. It is possible that the richer structure of 
categories may allow us to define causal theories more general than those arising from 
causal structures, such that models in some category are those that satisfy precisely 
a given set of conditional independencies, and no more. 

Finally, Sections 4.2 and 4.3 suggest their own further lines of investigation. While 
we have focussed on models in Stoch and its subcategories, it would also be worth- 
while to understand more thoroughly models in Rel, and models in the category 
Hilb of Hilbert spaces and linear maps may be interesting from the perspective of 
quantum theory. It would also be interesting to find further examples of applications 
of the graphical languages for causal theories. One option is to look at representa- 
tions of algorithms used on Bayesian networks, such as Gibbs sampling in Bayesian 
networks [TO] . 



w 




w' 
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